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PREFACE 


This  report  examines  selected  issues  regarding  the  measurement  and 
calculation  of  the  resource-based  relative  values  for  physician  work  to 
be  used  in  the  Medicare  Fee  Schedule.  Of  particular  interest  are  is- 
sues central  to  the  obtaining  of  survey  data  from  physicians  and  cal- 
culations used  to  transform  those  data  to  work  values.  Different 
methods  of  data  collection  and  analysis  have  consequences  in  terms  of 
the  distribution  of  Medicare  payments  for  different  services,  which 
must  be  taken  into  consideration  in  revising  the  system  as  it  evolves 
into  a  "steady  state."  Only  by  understanding  these  consequences  can 
the  Health  Care  Financing  Administration  (HCFA)  maintain  a 
steady-state  Medicare  Fee  Schedule  that  provides  a  stable  policy  for 
the  medical  community  and  the  public  yet  is  responsive  to  changes  in 
the  economics  and  technology  of  medical  practice.  The  results  re- 
ported here  should  be  useful  to  HCFA  in  choosing  a  general  method 
for  adjusting  work  values  as  well  as  in  making  other  medical  policy 
choices. 

The  analyses  reported  here  were  performed  within  the  RANDAJCLA/ 
Harvard  Center  for  Health  Care  Financing  Policy  Research. 


SUMMARY 


PHYSICIAN  WORK  VALUES  FOR  THE  MEDICARE 
FEE  SCHEDULE 

The  Medicare  Fee  Schedule  (MFS),  which  took  effect  as  mandated  by 
Congress  in  January  1992,  replaces  the  system  of  customary,  prevail- 
ing, and  reasonable  charges  used  by  the  Health  Care  Financing 
Administration  (HCFA)  for  physician  pa3mient  with  a  system  of  pay- 
ment based  on  relative  value  units  (RVUs).  The  RVU  for  a  physi- 
cian's service  comprises  three  elements:  (1)  the  relative  value  for 
physician  work  (RVW);  (2)  practice  expenses,  i.e.,  overhead,  excluding 
malpractice  expenses;  and  (3)  malpractice  expenses.  Under  the  MFS, 
payment  for  a  service  involves  adjusting  each  of  these  three  elements 
of  the  RVU  by  a  separate  geographical  factor,  then  summing  to 
produce  a  value  for  the  service.  This  RVU  is  then  multiplied  by  a 
national  conversion  factor  to  yield  a  dollar  amount  for  payment. 

The  major  innovation  of  the  MFS — and  the  focus  of  this  report — is  the 
measurement  of  RVWs.  Estimates  of  physician  work  under  the  new 
fee  schedule  are  based  on  the  Resource-Based  Relative  Value  Scale 
(RBRVS).  That  scale  was  constructed  by  a  team  of  researchers  at  the 
Harvard  School  of  Public  Health,  headed  by  William  C.  Hsiao,  Ph.D. 
As  the  RBRVS  has  evolved,  it  has  been  extensively  commented  upon 
and  criticized  and  numerous  revisions  have  been  recommended. 
However,  the  estimates  of  physician  work  contained  in  the  MFS  rules 
issued  to  take  effect  in  January  1992  are  based  primarily  on  the 
Harvard  study. 

The  focus  of  the  project  reported  on  here  was  to  examine  alternative 
ways  to:  (1)  obtain  survey  data  on  the  amount  of  work  performed  by 
physicians  in  different  specialties,  and  (2)  transform  those  data  to  a 
common  scale  of  relative  work  values.  The  work  reported  on  here  was 
largely  conducted  in  1991,  before  any  public  release  of  the  results  of 
Phase  HI  of  the  Harvard  study  or  the  HCFA  final  regulations. 

STEPS  TO  DEVELOPING  RELATIVE  WORK  VALUES 

The  development  of  the  RBRVS  has  involved  a  five-step  process:  (1) 
obtaining  raw  survey  data  on  physician  work  separately  for  each 
"major"  specialty,  (2)  fitting  data  from  each  specialty  onto  a  common 
relative  value  scale,  (3)  calculating  total  work  based  on  estimates  of 
pre-  and  post-service  work,  (4)  mapping  work  values  for  surveyed  ser- 
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vices  into  current  procedural  terminology  (CPT)  codes  used  for  pay- 
ments, and  (5)  extrapolating  work  values  from  surveyed  services  to 
nonsurveyed  services.  Although  the  final  step  has  become  moot  as 
virtually  all  services  have  been  surveyed,  the  remaining  four  merit 
discussion. 

Step  1:  Obtain  Specialty-Specific  Work  Values 

The  first  step  of  the  RBRVS  is  to  obtain  RVWs  for  different  services 
performed  by  physicians.  From  the  very  beginning,  the  consensus 
has  been  that  the  best  way  to  do  this  is  to  ask  physicians;  the  issue 
has  been  how  to  ask  the  question.  The  Harvard  study  adopted  the 
principle  that  the  basic  physician  effort  was  "intra- service"  work,  or 
the  work  involved  in  the  main  part  of  delivery  of  a  service. 

Magnitude  Estimation.  Intra-service  work  has  been  directly  as- 
sessed by  "magnitude  estimation,"  where  physician-respondents  rate 
the  work  of  a  service  as  a  multiple  or  fraction  of  the  work  involved  in 
a  standard  service.  For  example,  surgeons  compared  procedures  to 
the  work  required  to  perform  an  uncomplicated  indirect  inguinal  her- 
nia repair,  which  was  defined  as  having  a  work  value  of  100.  If  a  par- 
ticular procedure  required  half  the  work  of  the  standard,  it  was  rated 
50;  if  it  required  three  times  as  much  work,  it  was  rated  300. 

Considering  all  the  alternatives,  the  claims  for  the  validity  of  magni- 
tude estimation  appear  convincing.  The  RVWs  derived  from  magni- 
tude estimation  all  show  the  independent  influences  of  four  separate 
components  of  work — time,  cognitive  effort,  physical  effort,  and 
stress,  in  ways  that  understandably  differ  across  different  medical 
specialties. 

Specialty-Specific  Telephone  Surveys.  In  both  Phases  I  and  II, 
survey  data  were  obtained  from  physicians  through  specialty-specific 
telephone  surveys  of  nationally  representative  samples.  In  Phase  II, 
a  side-study  explored  other  possible  ways  of  obtaining  data.  It  con- 
cluded that  introducing  interaction  among  panelists  by  employing 
groups  led  to  results  diverging  from  the  so-called  "gold  standard"  of 
the  telephone  survey  results.  Nonetheless,  Phase  III  was  supposed  to 
use  a  small-group  process.  Section  3  of  this  report  examines  the  rel- 
ative merits  of  alternative  survey  methods  for  obtaining  RVW  esti- 
mates and  the  costs  of  different  survey  techniques. 

Rating  Vignettes.  Each  specialty  was  asked  to  provide  estimates  of 
intra-service  work  for  about  23  services,  presented  as  "vignettes." 
Each  vignette  was  a  brief  description  of  a  patient  and  the  service  pro- 
vided. A  number  of  concerns  have  been  raised  about  the  use  of  these 


vignettes,  including  who  provided  the  ratings,  individual  differences 
in  perception  of  the  work  involved  in  performing  the  vignettes,  and 
the  representativeness  of  the  vignettes  as  exemplars  of  CPT  codes. 
Each  concern  has  been  addressed  in  criticisms  of  the  Harvard  study 
and  in  research  done  by  Abt  Associates,  the  Physician  Payment  Re- 
view Commission,  and  others.  The  result  has  been  minor  modifica- 
tions to  the  process  in  later  phases  of  the  Harvard  study. 


Step  2:  Fit  Work  Values  into  a  Common  Scale 

Work  values  have  been  obtained  for  different  medical  specialties,  at 
different  times,  and  using  different  "standard"  services  as  the  basis 
for  magnitude  estimation.  Variation  in  any  of  these  factors  means 
that  the  obtained  values  are  not  commensurate;  "linkage"  procedures 
are  necessary  to  fit  measurements  to  a  common  scale.  This  calibra- 
tion has  been  done  by  declaring  services  from  different  specialties  as 
having  the  same  amount  of  work  and  employing  least-squares  regres- 
sion procedures  to  produce  a  set  of  adjustments  to  transform  spe- 
cialty-specific survey  values  to  a  common  scale.  We  discuss  this  link- 
age process  in  detail  below  and  in  Section  4. 


Step  3:  Calculate  Total  Work  Using  Estimates  of  Pre-  and 
Post-Service  Work 

The  magnitude  estimates  of  intra-service  work  are  converted  to  a  to- 
tal amount  of  work  by  adding  pre-service  and  post-service  effort.  In 
Phase  I,  this  was  done  by  first  obtaining  estimates  of  pre-  and  post- 
service  work  for  a  sample  of  vignettes  and  then  using  regression 
analyses  to  extrapolate  to  the  other  surveyed  services. 

The  establishment  of  total  work  from  intra-service  work  was  subject 
to  much  criticism,  largely  centering  on  the  definition  of  what  consti- 
tuted pre-  and  post-service  work  and  how  the  various  components  of 
this  work  should  be  structured  for  separate  assessment.  In  response 
to  these  criticisms,  Harvard's  Phase  II  developed  refined  estimates  of 
pre-  and  post-service  work  by  defining  the  work  involved  in  those 
time  periods  more  precisely. 


Step  4:  Map  Work  Values  for  Translating  Vignettes  into 
CPT  Codes 

After  linkage  and  determination  of  total  work  values  for  all  the  vi- 
gnettes, the  work  values  must  be  assigned  to  billing  codes.  Where 
there  was  a  one-to-one  mapping  between  a  single  vignette  and  a  sin- 
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gle  code,  this  was  a  straightforward  task.  However,  the  assignment 
was  not  always  straightforward  because:  (1)  the  translation  from  vi- 
gnette to  the  appropriate  code  was  not  subject  to  an  unambiguous  set 
of  rules,  (2)  some  vignettes  from  the  same  specialty  were  assigned  to 
the  same  code,  and  (3)  some  vignettes  from  different  specialties  were 
assigned  to  the  same  code. 

Ambiguous  cases  were  decided  by  various  panels  convened  for  the 
purpose,  so  that,  eventually,  most  vignettes  were  assigned  to  one 
code.  For  services  from  a  common  specialty  that  shared  a  code,  the 
RVW  was  calculated  as  the  arithmetic  mean  of  the  vignette  work 
values.  For  services  from  different  specialties  that  shared  a  code,  the 
common  RVW  was  calculated  as  the  volume-weighted  average  (using 
Medicare  Part  B  data)  of  work  values  from  the  realigned  specialty- 
specific  scales. 

One  type  of  billing  code  common  to  virtually  all  specialties  is  evalua- 
tion and  management  (EM)  (originally  numbered  90000  through 
90699  in  the  CPT  coding  system).  As  the  RBRVS  process  evolved,  it 
became  clear  that  the  EM  codes  were  not  adequate.  These  codes  were 
replaced,  beginning  in  1992,  by  a  new  set  of  codes  (numbered  99200 
through  99499),  whose  work  values  were  estimated  by  a  panel  of 
Medicare  Carrier  Medical  Directors  convened  for  that  purpose.  The 
panel  used  values  for  the  original  EM  codes  as  a  starting  point  and 
translated  work  values  to  the  new  codes  when  they  believed  it  appro- 
priate. 

OBTAINING  MAGNITUDE  ESTIMATES  FROM  PHYSICIANS 

To  obtain  magnitude  estimation  data  for  RVWs,  one  needs  to  consider 
who  should  provide  estimates  of  work  values,  alternative  methods  for 
surveying  physicians,  and  the  cost  of  obtaining  data. 

Who  Should  Estimate  the  Magnitude  of  Work? 

In  deciding  who  will  estimate  work,  the  data  collector  must  ensure 
that  respondents  are  qualified  to  respond  and  that  the  data-collection 
process  is  not  subject  to  "gaming"  or  other  biases.  The  Harvard  study 
randomly  sampled  physicians  from  the  AMA  Masterfile,  a  process 
that  was  questioned  by  some,  although  not  on  the  basis  of  any  hard 
evidence.  Several  critics  have  argued  that  respondents  should  be 
drawn  by  specialty  societies,  but  that  course  of  action  runs  a  risk  of 
conflict  of  interest.  A  resolution  of  the  different  positions  can  be 
found  by  employing  a  Universal  Provider  Identification  (UPIN) 
database  about  to  begin  at  HCFA.  This  new  database  can  be  used  as 
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the  source  for  physicians  performing  more  than  a  specified  minimum 
number  of  services  within  the  set  to  be  surveyed. 

Group-Based  Methods  for  Obtaining  Work  Values 

As  part  of  Phase  II,  the  Harvard  study  investigated  the  possibihty  of 
using  small  group  process  methods  for  estimating  work  values  in- 
stead of  the  telephone  survey  approach  used  in  Phase  I.  We  suggest, 
contrary  to  the  conclusions  of  the  Phase  II  final  report,  that  the  data 
argue  for  the  future  use  of  some  small  group  process,  either  Delphi  or 
face-to-face,  for  generating  physicians'  work  values. 

Our  opinion  is  buttressed  by  our  examination  of  the  recent  social  psy- 
chological literature  of  empirical  individual  and  group-based  judg- 
ment and  decisionmaking.  We  investigated  whether  tasks  similar  to 
assessing  relative  work  values  were  better  suited  to  collective  indi- 
vidual or  group  methods.  Our  search  yielded  23  published  studies, 
looking  at  both  intellective  (where  a  correct  answer  can  be  deter- 
mined) and  judgmental  (where  there  is  no  a  priori  correct  answer)  de- 
cision tasks.  We  view  estimating  physician  work  as  midway  on  the 
intellective-judgmental  continuum.  The  studies  were  consistent  in 
showing  that,  strongly  for  intellective  tasks  and  moderately  for  judg- 
mental tasks,  small-group  processing  produces  greater  accuracy  and 
more  hypothesis  evaluation  than  individual  processing.  For  both 
types  of  tasks,  a  group  advantage  occurs  without  a  significant  degra- 
dation of  output  caused  by  differences  in  member  ability  and  status. 
This  result  suggests  that  future  estimates  of  RVWs  should  be  ob- 
tained by  some  form  of  group  method  that  permits  interaction  and 
feedback  among  respondents. 

The  Costs  of  Different  Methods  of  Data  Collection 

To  better  decide  which  method  of  obtaining  work  value  data  to  choose, 
we  developed  cost  estimates  for  collecting  such  data  for  four  data-col- 
lection methods:  (1)  interviewer-administered,  one-round  telephone 
survey;  (2)  one-round  mail  survey;  (3)  two-round  mail  survey  (Delphi); 
and  (4)  one-round  mail  survey  with  a  group  discussion  follow-up. 

The  cost  estimates  were  for  a  hypothetical  revision  of  the  RVW  that 
would  require  obtaining  assessments  for  600  services — 50  services 
each  from  12  different  panels  for  the  first  three  methods  and  200  ser- 
vices from  each  of  three  panels  for  the  discussion  method.  Sample 
sizes  for  the  methods  were  chosen  to  produce  approximately  equal  be- 
tween-physician  standard  deviations  of  intra-service  work  values. 


The  estimated  costs  showed  that  the  telephone  survey  was  the  most 
expensive  method,  at  $105,000.  The  single-round  mail  survey  was 
the  least  expensive,  costing  $65,500.  The  two  group  methods  were 
approximately  equal  in  cost,  with  the  discussion  method  ($88,000) 
costing  10  percent  more  than  the  mail  survey  plus  panel  ($80,000). 
The  smaller  sample  size  of  the  discussion  method  was  offset  by  the 
travel  costs  to  convene  the  groups.  Based  on  cost,  there  is  no  reason 
to  choose  one  group  method  over  the  other,  whereas  if  an  individual- 
ratings  method  is  chosen,  the  mail  survey  has  clear  advantages  over 
the  telephone  survey. 

LINKAGE 

In  Phase  II  of  the  Harvard  study,  275  links  were  used  to  align  the 
specialty  surveys  from  both  Phase  I  and  Phase  II  onto  a  common 
scale.  In  producing  the  common  scale,  the  source  individual  spe- 
cialty-specific scales  are  not  rescaled  internally,  so  that  the  work 
value  relationships  within  each  scale  stay  the  same  after  adjustment. 
For  the  present  project,  we  attempted  to  replicate  the  Harvard  link- 
age procedure  and  developed  an  alternative  linkage  procedure  based 
on  a  different  set  of  assumptions  than  that  used  by  Harvard. 

Replicating  the  Harvard  Linkage  Procedure 

We  attempted  to  replicate  the  Harvard  linkage  procedure.  Because 
only  the  means  and  standard  deviations  of  services  over  physicians 
were  available  to  us,  we  could  not  use,  much  less  validate,  the  estima- 
tion-maximization averaging  of  the  physician-level  data  used  in  the 
Harvard  study.  In  this  replication  process,  we  found  some  problems: 

•  One  specialty,  ophthalmology  in  Phase  I,  did  not  have  estimated 
standard  errors;  we  instead  used  the  average  standard  error  for  all 
ophthalmological  services  in  Phase  II  as  a  surrogate. 

•  The  Harvard  study  variance  estimates  were  multiplied  by  the 
number  of  physicians  surveyed  for  each  service. 

•  It  was  unclear  which  values  were  used  to  center  services  linked  by 
total  work  instead  of  intra-service  work. 

•  The  biweight  procedure  acted  as  a  filter  to  eliminate  some  linkages 
from  the  regression  estimation.  Although  this  was  statistically  ap- 
propriate, it  may  have  implications  that  have  not  yet  been  investi- 
gated. 


Our  replication  attempt  was  largely  successful.  In  general,  our  re- 
sults are  within  10  percent  of  the  Harvard  results  except  for  the  three 
specialties  dermatology  (Phase  II  survey),  ophthalmology  (Phase  I 
survey),  and  orthopedic  surgery  (Phase  II  survey). 

A  New  Look  at  Linkage 

Our  experience  with  the  Harvard  linkage  procedures  led  us  to  con- 
sider an  alternative  to  their  methodology.  This  alternative,  which  we 
term  the  perturbation  minimization  procedure,  is  based  on  four  con- 
siderations that  differ  somewhat  from  those  employed  in  the  Harvard 
study: 

•  If  the  links  are  among  services  with  equivalent  amounts  of  work, 
then  the  RBRVS  scale  should  reflect  this  equivalence. 

•  Links  should  be  transitive. 

•  Because  a  OPT  code  represents  the  same  work  across  specialties  on 
average,  it  constitutes  an  implicit  link. 

•  Adjusting  work  values  through  linkage  should  preserve  as  much  as 
possible  the  originally  surveyed  work  relationships  among  services. 

The  perturbation  minimization  procedure  takes  place  in  two  discrete 
steps:  redefinition  of  links  and  readjustment  of  scale  values  after 
linkage. 

Redefinition  of  Links.  We  began  with  our  link  set  equal  to  the 
Harvard  link  set.  We  then  changed  this  link  set  in  three  ways.  First, 
we  dropped  the  Harvard  intensity  links  because  these  links,  which 
implicitly  defined  work  as  the  product  of  time  and  intensity,  seem 
contradictory.  This  resulted  in  the  loss  of  32  of  the  275  original 
Harvard  links  that  were  intensity  links.  Second,  we  expanded  our 
link  set  so  that  all  services  with  the  same  OPT  code  were  linked.  For 
any  specialty  in  which  a  CPT  code  was  surveyed  more  than  once,  we 
formed  a  new  "service"  whose  work  value  was  the  average  of  the  work 
values  of  the  services  with  that  CPT  code.  This  averaging  produced 
83  new  services.  We  then  linked  all  common  CPT  codes  across  spe- 
cialties using  these  averaged  services  when  applicable  so  that  a  spe- 
cialty had  no  links  within  its  own  scale.  Common  CPT  code  links 
were  not  formed  for  EM  CPT  codes  90000  through  90699  because  of 
inherent  problems  with  these  codes.  Third,  we  expanded  linkages  to 
create  transitive  subsets  of  interlinked  services,  which  we  termed 
orbits.  We  created  208  orbits,  resulting  in  a  new  total  of  638  links. 


The  results  of  recalculating  the  least-squares  linkage  procedures  with 
the  expanded  link  set  produced  major  differences  from  the  original 
Harvard  set.  Given  that  our  set  of  assumptions  is  as  reasonable  as 
that  used  by  Harvard,  these  results  indicate  that  the  linkage  proce- 
dure is  quite  sensitive  to  underlying  assumptions,  and  its  validity  is 
consequently  questionable. 

Readjusting  Scale  Values.  Our  proposed  revision  of  linkage  re- 
quires that  all  linked  services  within  an  orbit  have  the  same  work 
value  on  the  common  scale.  At  the  same  time,  we  would  like  the  op- 
timization to  ensure  that  the  distances  between  services  within  a  spe- 
cialty stay  as  close  as  possible  to  the  original  surveyed  distances.  In 
essence,  after  redefining  the  linked  service  values,  we  seek  to  adjust 
the  original  values  to  preserve  as  much  as  possible  their  relationships 
to  the  linked  and  unlinked  services  in  their  specialty.  We  developed  a 
least-squares  procedure  to  do  this. 

In  general,  though  the  percentage  changes  are  smaller  than  those  re- 
sulting from  the  changes  in  definition  of  linkage,  there  are  still  major 
differences  between  Harvard's  and  the  present  results.  One-fifth  of 
all  OPT  codes  had  RVWs  that  differed  from  the  Harvard  values  by  15 
percent  or  more.  As  before,  we  do  not  claim  that  our  results  should 
replace  the  earlier  ones,  but  only  that  the  linkage  process  is  sensitive 
to  methods  and  therefore  should  be  approached  with  great  caution. 

RECOMMENDATIONS 

Our  objective  in  this  report  is  not  to  tell  HCFA  how  the  Harvard 
study  should  have  been  done;  nor  is  it  to  relate  what  was  done  cor- 
rectly and  incorrectly  in  the  Harvard  study.  Instead,  our  goal  is  to 
specify  how  HCFA  best  can  modify  the  RBRVS.  Such  modifications 
include  both  short-term  fixes  to  take  care  of  egregious  errors,  known 
biases,  and  omissions  from  the  original  set  of  RVUs  and  long-term 
fixes  to  ensure  that  the  MFS  will  be  a  living  policy,  sensitive  to 
changes  in  medical  technology  and  the  economics  of  health  care. 

We  recommend  that  any  future  magnitude  estimation  of  work  values 
be  done  using  a  method  that  permits  individual  raters  to  interact 
with  each  other  as  they  estimate  RVWs.  The  two  leading  candidates 
for  such  a  method  are  (1)  a  two-round  mail  survey  with  feedback  on 
the  distribution  of  responses  between  rounds  (i.e.,  a  Delphi  process) 
and  (2)  a  discussion  panel  preceded  by  a  preliminary  mail  round.  The 
two  methods  appear  equally  valid  and  do  not  differ  greatly  in  cost. 
The  benefits  obtained  in  representativeness  from  a  larger  respondent 
sample  favor  use  of  the  mail  survey  for  long-term  changes  to  the 
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RBRVS.  But  for  short-term  demands,  the  efficiencies  in  time  and  ef- 
fort from  assembhng  a  discussion  panel  make  it  the  better  option. 

We  recommend  that  the  new  HCFA  universal  provider  file  be  used  to 
select  physicians  with  the  necessary  experience  in  the  target  services 
for  surveys  or  panels.  Representatives  from  specialty  societies  can 
observe  and  advise  panels. 

Our  examination  of  the  Harvard  study's  linkage  procedures  has  un- 
earthed a  number  of  technical  problems  and  conceptual  ambiguities. 
We  devised  an  alternative  linkage  procedure  based  on  a  modified  set 
of  assumptions  and  showed  that  this  procedure  led  to  major  differ- 
ences in  work  values.  The  conclusion  from  our  analyses  is  not  that 
our  alternative  is  better  than  the  original  linkage  procedure,  but 
rather  that  the  links — and  therefore  any  work  values  that  are 
touched  by  the  links — are  very  sensitive  to  changes  in  the  assump- 
tions underlying  the  procedure.  If  linkage  procedures  are  to  be  used 
in  the  future,  a  good  deal  of  further  research  is  required  to  ensure 
their  validity. 

However,  we  do  not  believe  that  linkage  need  be  used  for  revising  the 
RBRVS.  Instead,  if  a  core  reference  set  of  RVWs  whose  validity  is  not 
questioned  can  be  found,  this  reference  set  can  be  used  to  define  a 
common  scale  for  any  future  estimations  of  work  values,  thereby  ob- 
viating the  need  to  transform  an  idiosyncratically  scaled  set  of  values 
to  a  common  scale.  Because  of  the  importance  of  any  such  reference 
set,  its  validity  must  be  firmly  established  by  empirically  replicable 
means. 
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1.  INTRODUCTION 


BACKGROUND 

The  Medicare  Fee  Schedule  (MFS),  which  took  effect  as  mandated  by 
Congress  in  January  1992,  replaces  the  system  of  customary,  prevail- 
ing, and  reasonable  charges  used  by  the  Health  Care  Financing 
Administration  (HCFA)  for  physician  payment  with  a  system  of  pay- 
ment based  on  relative  value  imits  (RVUs).  The  RVXJ  for  a  physi- 
cian's service  comprises  three  elements:  (1)  the  relative  value  for 
physician  work  (RVW);  (2)  practice  expenses,  i.e.,  overhead,  excluding 
malpractice  expenses;  and  (3)  malpractice  expenses.  Under  the  MFS, 
payment  for  a  service  involves  adjusting  each  of  these  three  elements 
of  the  RVU  by  a  separate  geographical  factor,  then  summing  to 
produce  a  value  for  the  service.  This  RVU  is  then  multiplied  by  a 
national  conversion  factor  to  yield  a  dollar  amount  for  payment. 

The  MFS  has  potential  major  consequences  for  both  the  total  amount 
and  the  distribution  of  Medicare  payments  for  physician  services. 
Because  of  the  large  magnitude  of  expected  payment  redistributions 
and  the  political  significance  of  reforming  a  major  public  program 
such  as  Medicare,  the  proposed  MFS  has  attracted  and  will  continue 
to  attract  close  scrutiny  and  criticism. 

The  major  innovation  of  the  MFS — and  the  focus  of  this  report — is  the 
measurement  of  RVWs.  Estimates  of  physician  work  under  the  new 
fee  schedule  are  based  on  the  Resource-Based  Relative  Value  Scale 
(RBRVS),  which  is  the  product  of  a  major  research  effort  conducted 
since  1986  by  a  team  of  researchers  at  the  Harvard  School  of  Public 
Health,  headed  by  William  C.  Hsiao,  Ph.D.  The  "Harvard  study"  has 
been  conducted  in  three  separate  phases  through  a  series  of  coopera- 
tive agreements  funded  by  the  Health  Care  Financing  Administration 
(HCFA).  The  Physician  Payment  Review  Commission  (PPRC)  has 
also  played  a  central  role  in  reviewing  the  results  of  the  Harvard 
study  as  it  has  evolved,  as  well  as  in  making  independent  recommen- 
dations concerning  improvements  and  refinements  to  the  MFS. 
Moreover,  as  the  RBRVS  has  evolved,  it  has  been  extensively  com- 
mented upon  and  criticized,  and  numerous  revisions  have  been  rec- 
ommended. However,  the  estimates  of  physician  work  contained  in 
the  MFS  rules^  issued  to  take  effect  in  January  1992  are  based  pri- 


^Health  Care  Financing  Administration  (1991b). 


marily  on  the  findings  of  Phase  II  and  part  of  Phase  III  of  the 
Harvard  study. ^ 

OUTLINE  OF  THIS  REPORT 

This  project  examines  selected  issues  regarding  the  measurement  and 
calculation  of  RVWs  that  resulted  from  the  RBRVS.  We  have  exam- 
ined alternative  methods  for:  (1)  obtaining  survey  data  on  the 
amount  of  work  performed  for  services  from  physicians  in  different 
specialties,  and  (2)  transforming  those  data  into  a  common  scale  of 
relative  work  values.  The  work  reported  here  was  conducted  in  1991, 
before  any  release  of  results  of  Phase  III  of  the  Harvard  study  or  the 
HCFA  final  regulations. 

Our  objective  is  not  to  tell  HCFA  how  the  Harvard  study  should  have 
been  done;  nor  is  it  to  relate  what  was  done  correctly  and  incorrectly 
in  the  Harvard  study.  The  political  reality  is  that  the  initial  RVWs 
provided  by  the  Harvard  study  are — like  it  or  not — ^here  to  stay.  The 
next  step  is  to  specify,  learning  from  the  lessons  of  the  Harvard  study 
and  associated  efforts,  how  HCFA  best  can  modify  the  RBRVS.  Such 
modifications  include  both  short-term  fixes  to  take  care  of  egregious 
errors,  known  biases,  and  omissions  from  the  original  set  of  RVUs 
and  long-term  fixes  to  ensure  that  the  MFS  will  be  a  living  policy, 
sensitive  to  changes  in  medical  technology  and  the  economics  of 
health  care. 

Section  2  of  this  report  presents  a  detailed  discussion  of  the  develop- 
ment of  the  RBRVS.  It  examines  the  steps  employed  in  creating  the 
RBRVS,  identifies  limitations  in  the  methods  and  assumptions  re- 
lated to  each  step,  and  notes  the  major  criticisms  of  the  process  that 
have  been  published. 

The  next  two  sections  examine  specific  issues  in  constructing  the 
RBRVS  and  present  alternative  methods  that  could  be  used  in  future 
revisions  of  the  MFS.  Section  3  compares  individual-  and  group- 
based  survey  methods  for  obtaining  the  magnitude  estimation  data 


^In  this  report,  we  refer  to  the  complete  body  of  work  related  to  the  development  of 
the  RBRVS  as  the  "Harvard  study,"  specifying  the  phase  when  appropriate.  Phase  I  of 
the  study  was  summarized  in  a  series  of  articles  in  the  October  28,  1988,  issue  of  the 
Journal  of  the  American  Medical  Association  (Becker,  Dunn,  and  Hsiao,  1988;  Braun  et 
al.,  1988a;  Braun  et  al.,  1988b;  Dimn  et  al.,  1988;  Hsiao  et  al.,  1988a;  Hsiao  et  al., 
1988b;  Hsiao  et  al,  1988c;  Hsiao  et  al.,  1988d;  Kelly  et  al.,  1988).  The  final  report  of 
Phase  n  of  the  Harvard  study  is  Hsiao  et  al.  (1990);  other  journal  articles  are  forthcom- 
ing. Results  from  Phase  HI  were  presented  to  HCFA  throughout  1991  and  1992.  The 
final  report  of  this  phase  was  scheduled  for  release  in  December  1991  but  has  not  yet 
been  made  available  as  of  this  writing  (April  1992). 


that  provide  RVWs.  It  examines  questions  concerning  who  should 
provide  ratings,  the  method  for  obtaining  measurements,  and  the 
costs  of  obtaining  data  using  alternative  methods. 

Section  4  looks  at  the  issue  of  merging  surveys  from  different  special- 
ties and  possibly  over  different  time  periods  into  a  common  scale  of 
measurement.  The  least  possible  distortion  of  the  magnitude  rela- 
tionships among  services  is  desirable  when  transforming  from  a  spe- 
cialty-specific to  a  common  scale.  In  this  section,  we  examine  the 
methods  used  to  derive  the  common  scale,  propose  alternative  meth- 
ods, and  compare  the  results  of  the  different  methods. 

Finally,  Section  5  briefly  recapitulates  the  findings  of  the  previous 
two  sections  and  recommends  how  to  revise  RVWs  for  both  the  short- 
term  and  long-term. 


DEVELOPMENT  OF  THE  RESOURCE-BASED 
RELATIVE  VALUE  SCALE 


OVERVIEW 

This  section  describes  how  estimates  of  RVWs  were  developed  for  use 
in  the  MFS  and  examines  several  important  methodological  issues 
and  assumptions  related  to  the  process  for  developing  these  esti- 
mates. ^  Because  the  results  of  the  last  two  phases  of  the  Harvard 
study  have  not  been  widely  disseminated  and  evaluated,  one  impor- 
tant objective  of  this  section  is  to  examine  critically  the  methods  and 
assumptions  employed  to  produce  the  final  RBRVS  that  is  effective 
for  calendar  year  1992.2 

The  development  of  the  RBRVS  involved  a  five-step  process:  (1)  ob- 
taining raw  survey  data  on  physician  work  separately  for  each 
"major"  specialty,  (2)  fitting  data  from  each  specialty  onto  a  common 
relative  value  scale,  (3)  calculating  total  work  based  on  estimates  of 
pre-  and  post-service  work,  (4)  mapping  work  values  for  surveyed  ser- 
vices into  codes  used  for  payments,  and  (5)  extrapolating  work  values 
from  surveyed  services  to  nonsurveyed  services.  Here,  we  will  look 
separately  at  each  step. 

STEP  1:  OBTAIN  SPECIALTY-SPECIFIC  WORK  VALUES 

The  first  step  of  the  RBRVS  was  to  obtain  RVWs  for  different  services 
performed  by  physicians.  From  the  very  beginning,  the  consensus 
has  been  that  the  best  way  to  do  this  is  to  ask  physicians;  the  issue 
has  been  how  to  ask  the  question.  The  Harvard  study  adopted  the 
principle  that  the  basic  piece  of  physician  effort  was  "intra- service" 
work,  or  the  work  involved  in  the  main  part  of  delivery  of  a  service. 
Although  intra-service  work  as  a  term  of  art  has  evolved  slightly  over 
time,  basically  it  means: 

•  For  office-based  evaluation  and  management  (EM)  services:    the 
face-to-face  encounter  time; 

•  For  hospital  visits:  the  time  spent  on  the  floor; 


^A  detailed  table  summarizing  the  tasks  and  research  methods  of  all  three  phases  of 
the  Harvard  study,  as  well  as  other  studies  that  have  had  an  influence  on  the  devel- 
opment of  the  RBRVS,  is  presented  in  Appendix  A. 

Health  Care  Financing  Administration  (1991b). 


•  For  surgical  procedures:  the  skin-to-skin  time;  and 

•  For  laboratory  and  imaging  services:  the  entire  task. 

Magnitude  Estimation 

The  Harvard  study  employed  "magnitude  estimation"  as  the  primary 
methodology  for  obtaining  physician  work  values.  Magnitude  estima- 
tion is  a  well-established  psychometric  technique^  that  has  been  suc- 
cessfully employed  to  assess  subjective  values  in  many  different  do- 
mains. Respondents  are  given  a  reference  value  or  values  that  define 
a  unit  of  measurement;  for  the  Harvard  study,  this  was  the  so-called 
"standard"  service  (e.g.,  for  general  surgeons,  an  uncomplicated  indi- 
rect inguinal  hernia  repair  on  a  45  year  old  male),  which  was  defined 
as  having  a  work  value  of  100.  Then,  the  respondents  rate  every 
other  service  in  the  survey  relative  to  the  standard  or  reference  value. 
For  example,  if  a  particular  service  requires  half  the  work  of  the 
standard,  it  should  be  rated  50;  if  it  requires  three  times  the  work  of 
the  standard,  it  should  be  rated  300.^ 

Direct  magnitude  estimation  is  not  the  only  possible  way  to  obtain 
RVWs.  One  could  instead  decide  that  all  work  was  of  equal  effort  and 
simply  base  work  on  the  time  spent.^  Alternatively,  one  could  de- 
compose work  into  separate  parts,  somehow  estimate  the  parts  sepa- 
rately, and  estimate  the  correct  recomposition  function  to  calculate 
the  RVW  as  a  whole.  Or,  one  could  employ  some  form  of  multiple 
comparisons  technique  to  first  rank-order  and  then  place  on  a  cardi- 
nal scale  the  set  of  services  under  scrutiny.^ 

But,  considering  all  of  these  alternatives,  the  analyses  of  Phase  I  of 
the  Harvard  study "^  concerning  the  validity  of  magnitude  estimation 
appear  convincing.  The  RVWs  derived  from  magnitude  estimation 
reflect  a  linear  combination  of  four  separate  components  of  work — 
time,  cognitive  effort,  physical  effort,  and  stress.  Furthermore,  the 
weights  attributable  to  these  components  of  work  vary  across  differ- 
ent medical  specialties.  This  result  is  entirely  reasonable,  because 
different  specialties  are  not  likely  to  have  the  same  "mix"  of  work 
components.    This  finding  also  means  that  direct  magnitude  estima- 


^See,  for  example,  Stevens  (1957,  1966);  Stevens  and  Galanter  (1957). 

"^In  a  technical  sense,  this  method  should  be  called  ratio  estimation  rather  than 
magnitude  estimation  (Kahan,  1968). 

^For  example,  Maloney  (1991). 

^For  example,  Bock  and  Jones  (1968). 

''Hsiao  et  al.  (1988d). 


tion  of  work  is  a  more  efficient  way  to  obtain  values  than  measuring 
the  separate  components  of  work  and  their  appropriate  mix  for  each 
specialty. 

Despite  some  criticism  of  magnitude  estimation,^  we  found  no  inher- 
ent flaw  or  limitation  in  the  use  of  magnitude  estimation  to  obtain 
subjective  judgments  from  physicians  about  RVWs,  and  we  concur 
with  the  PPRC  and  other  recent  evaluations^  supporting  the  appro- 
priateness of  magnitude  estimation  for  this  purpose. 

We  note  a  major  inconsistency  in  the  use  of  magnitude  estimation. 
One  argument  for  accepting  magnitude  estimation  as  a  method  of 
estimating  work  is  that  the  problem  of  specifying  the  way  the  compo- 
nents of  work  combine  need  not  be  addressed.  Given  this  (in  our 
view)  strong  argument  for  magnitude  estimation,  we  find  it  perplex- 
ing that  the  Harvard  study  chooses,  when  convenient,  to  assume  that 
work  is  the  product  of  time  and  "intensity,"  where  intensity  is  implic- 
itly defined  as  everything  else  involved  in  work  except  time.  No  evi- 
dence was  presented  in  the  Harvard  study  or  by  anybody  else  to 
justify  the  hypothesis  that  work  equals  time  multiplied  by  intensity; 
this  lack  of  evidence  calls  into  question  any  analyses  that  assume  this 
hypothesis  to  be  correct.  This  issue  will  appear  a  number  of  times  in 
the  development  of  the  RBRVS. 

Specialty-Specific  Telephone  Surveys 

In  both  Phases  I  and  H,  survey  data  were  obtained  from  physicians 
through  specialty- specific  telephone  surveys  (18  specialties  in  Phase 
I;  15  in  Phase  H).  A  nationally  representative  sample  of  about  185 
physicians  was  identified  in  each  specialty  and  contacted;  approxi- 
mately 100  physicians  per  specialty  participated  in  the  surveys. 
Furthermore,  as  part  of  Phase  II,  seven  specialties  included  in  Phase 
I  were  resurveyed  either  because  they  constituted  a  substantial  por- 
tion of  services  paid  for  under  Part  B  of  Medicare  or  because  of  the 
need  for  a  broader  representation  of  subspecialties  or  services.  In  to- 
tal, the  telephone  surveys  produced  40  separate  surveys  that  were 
used  as  input  data  for  developing  the  RBRVS. 

Phase  III  was  supposed  to  use  a  so-called  small-group  process  (some 
combination  of  mail  surveys  and  face-to-face  meetings)  instead  of 


^See,  for  example,  Pasnak  (n.d.). 

Physician  Payment  Review  Commission  (1991),  p.  23. 


telephone  surveys  to  obtain  work  estimates.  Following  the  first  two 
phases,  the  RVWs  obtained  through  telephone  surveys  are  referred  to 
by  the  Harvard  study  group  as  "gold  standards,"  although  no  formal 
evaluation  of  alternative  methods  has  been  conducted  by  Harvard  or 
others  to  support  this  assumption.  Section  3  of  this  report  examines 
the  relative  merits  of  alternative  survey  methods  for  obtaining  RVW 
estimates,  including  the  costs  of  different  survey  techniques. 

A  statistical  algorithm  was  used  to  replace  nonresponse  missing  val- 
ues and  values  excluded  as  outliers.  For  particular  services,  surveyed 
physicians  might  not  provide  an  estimate  of  work,  for  example  if  they 
did  not  feel  qualified  to  respond.  To  estimate  a  service's  mean  work 
value  and  associated  standard  deviation  across  physicians  given  these 
missing  values,  the  estimation-maximization  algorithmic  provides  a 
better  estimate  of  the  mean  and  standard  deviation  than  ignoring  the 
missing  values  or  replacing  them  via  some  ad  hoc  procedure. 

The  estimation-maximization  algorithm  consists  of  an  estimation  step 
and  a  maximization  step  which  are  alternated  until  convergence.  To 
begin  with,  initial  estimates  for  the  parameters  are  calculated  based 
on  the  existing  data.  In  the  estimation  step,  the  conditional  expecta- 
tion of  the  missing  data's  contribution  to  the  likelihood  function  is 
calculated  given  the  present  estimated  parameters.  In  the  maximiza- 
tion step,  the  maximum  likelihood  estimates  of  the  parameters  are 
calculated  just  as  is  usually  done  when  no  data  are  missing.  These 
two  steps  are  iterated  until  the  parameter  estimates  converge. 
Although  the  use  of  such  an  algorithm  seems  appropriate,  we  were 
unable  to  evaluate  its  effect,  and  how  it  was  employed,  on  final  RVWs 
because  we  did  not  have  the  necessary  individual  physician-level 
data. 


Physician  Services  Defined  Using  **Vignettes" 

Each  specialty  was  asked  to  provide  estimates  of  intra-service  work 
for  about  23  services,  presented  as  "vignettes."  Each  vignette  was  a 
brief  description  of  a  patient  and  the  service  provided.  Although  vi- 
gnettes were  defined  independently  of  codes  used  for  billing  purposes, 
the  ultimate  goal  was  to  match  them  to  HCFA  Common  Procedure 
Coding  System  (HCPCS)  billing  codes.^ 


l^See,  for  example,  Dempster,  Laird,  and  Rubin  (1977);  Little  and  Rubin  (1987). 

^^The  HCPCS  is  used  for  payment  of  physician  services  under  Part  B  of  Medicare. 
HCPCS  is  primarily  based  on  the  American  Medical  Association's  (1991)  Current 
Procedural  Terminology  (CPT)  coding  system,  with  some  additional  codes  defined  by 
HCFA  and  its  fiscal  intermediaries. 


Criticisms  of  Data-Collection  Procedures 

A  number  of  concerns  have  been  raised  about  the  use  of  these  vi- 
gnettes in  Phase  I  of  the  Harvard  study.  ^^  phases  II  and  III  of  the 
study  were  designed  to  respond  to  these  criticisms  and  achieved  vary- 
ing degrees  of  success.  Here,  we  discuss  several  of  the  most  salient  of 
these  criticisms. 

Who  Performed  the  Ratings.  The  Phase  I  surveys  included  all 
physician  responses,  regardless  of  the  physician's  experience  in  per- 
forming surveyed  services  (i.e.,  "fitness  to  rate").  This  was  a  concern 
because  physicians  who  perform  a  service  infrequently  might  produce 
biased  estimates  of  work. 

This  concern  was  directly  addressed  in  a  study  conducted  by  Abt 
Associates  for  the  Society  of  Thoracic  Surgeons. ^^  This  study,  which 
examined  services  performed  by  thoracic  surgeons,  differed  from  the 
Harvard  study  in  that  the  overall  specialty  was  divided  into  three 
subspecialties,  each  of  which  was  independently  surveyed.  Re- 
spondents thus  were  required  to  have  personal  experience  with  the 
services  they  rated.  The  Abt  study  concluded  that  their  ratings  had 
greater  validity  than  the  Phase  I  Harvard  study  ratings  of  the  same 
specialty. 

The  Harvard  group  addressed  this  potential  bias  in  Phase  II  by  con- 
ducting regression  analyses  using  physician  characteristics,  including 
frequency  of  performing  a  service,  to  predict  physician-specific  devia- 
tions from  the  median  work  rating  for  each  service.  The  results  of 
this  analysis  indicated  that  the  physician's  frequency  of  performing  a 
service  had  no  significant  effect  in  explaining  these  deviations.  PPRC 
reports,  however,  that  its  own  analyses  led  to  the  conclusion  that  cer- 
tain services  would  have  had  substantially  different  work  values  if 
the  responses  of  physicians  who  performed  the  service  infrequently 
were  excluded  from  the  surveys. 

Individual  Differences  in  Work  Perception.  A  different  issue  re- 
garding the  individuals  providing  the  ratings  is  that  the  perceptions 
of  individual  physicians  might  differ,  either  randomly  or  systemati- 
cally, regarding  the  work  involved  in  the  standard  service.  That  is, 
different  people  might  have  different  ideas  of  how  much  work  is  in- 
volved in  "100  units  of  work."  The  Phase  I  surveys  assumed  that  all 


^^The  majority  of  these  criticisms  are  summarized  in  Physician  Payment  Review 
Commission  (1989,  pp.  37-39)  and  Physician  Payment  Review  Commission  (1991,  pp. 
24-26). 

13Noetheretal.  (1990). 


physicians  within  a  specialty  viewed  the  standard  service  as  the 
same,  absolute  level  of  work. 

One  major  refinement  in  Phase  II  is  that  physician  estimates  of  rela- 
tive work  were  adjusted  to  account  for  different  "perceptions"  of  the 
standard  service.  This  adjustment  process  required,  instead  of  the 
assumption  of  a  common  perception  of  work  for  the  standard  service, 
the  assumption  that  every  physician  share  a  common  mean  work 
value  across  the  surveyed  vignettes.  This  latter  assumption  also 
permitted  standard  errors  to  be  estimated  for  the  standard  service, 
and  the  Phase  II  final  report  stated  that  these  standard  errors 
"provide  a  more  valid  estimate  of  the  standard  deviation  for  each  ser- 
vice, including  the  standard  service." 

To  regard  the  resulting  difference  between  Phase  I  and  Phase  II  as  an 
improvement  assumes  that  physicians  shared  a  common  average 
value  of  work  across  the  22  to  25  services  surveyed.  We  have  no  a 
priori  reason  to  believe  that  this  assumption  is  valid.  Moreover,  the 
assumption  of  a  common  assessment  of  standard  services  is  not  really 
necessary  for  magnitude  estimation,  which  is  concerned  with  the  dif- 
ferences between  the  standard  and  measured  services. 

What  remains  is  to  better  address  the  problem  of  anchoring  standard 
services  so  that  we  may  safely  assume  that  different  raters  use  a 
common  scale.  To  date,  neither  the  Harvard  study  nor  any  other 
published  comments  have  addressed  this  issue. 

Vignettes.  A  number  of  criticisms  of  Phase  I  stated  that  the  vi- 
gnettes did  not  reflect  the  work  physicians  do.  For  example,  a  con- 
cern was  raised  that  the  vignettes  might  not  measure  the  work  in- 
volved in  treating  Medicare  (i.e.,  largely  elderly)  patients.  This  issue 
was  addressed  directly,  but  in  a  limited  fashion,  in  Phase  II.  The 
Phase  II  report  presents  evidence  for  six  pairs  of  vignettes  to  suggest 
that  intra-service  work  and  time  do  not  vary  substantially  for 
Medicare  and  non-Medicare  patients  receiving  the  same  service.  In 
addition,  more  vignettes  in  Phase  II  were  defined  using  age  65  and 
above. 

Also,  the  Abt  study  challenged  the  validity  of  the  standard  service 
used  for  the  measurement  of  thoracic  surgery.  It  also  questioned 
whether  the  vignettes  employed  for  that  survey  adequately  reflected 
the  nature  and  variety  of  work  performed  by  thoracic  surgeons. 
Again,  differences  between  the  Abt  study  and  the  Harvard  study  were 
attributed  by  Abt  in  part  to  refinement  of  the  standard  and  other  vi- 
gnettes, and  the  Abt  results  were  labeled  more  valid  than  the  original 
ones.  In  response  to  Abt,  the  Harvard  group  pointed  out  that  most  of 
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the  differences  between  the  two  studies  were  related  to  the  pre-ser- 
vice  and  post-service  work  estimations,  not  to  the  core,  intra-service 
work  estimations. 


STEP  2:  LINK  SPECIALTY-SPECIFIC  WORK  VALUES  ON  A 
COMMON  SCALE 

After  assessments  of  intra-service  work  had  been  separately  esti- 
mated for  different  specialties,  a  procedure  was  needed  to  compare 
these  assessments  on  a  common  scale.  We  might  liken  the  problem  to 
one  of  different  sets  of  length  measurements  using  rulers  marked 
with  different  units  of  measurement.  Some  rulers  measure  in  inches, 
others  in  centimeters,  and  still  others  in  feet,  yards,  or  Angstrom 
units.  To  make  these  measurements  comparable,  the  values  on  each 
ruler  must  be  transformed  by  a  multiplicative  constant.  To  find  the 
set  of  multiplicative  constants  requires  finding  objects  measured  on 
the  different  rulers  that  are  known  to  have  the  same  length. 

"Same"  and  "Equivalent"  Links 

Two  services  from  different  specialties  having  the  same  amount  of  in- 
tra-service work — and  thus  usable  for  comparing  specialty-specific 
measurements — were  called  "linked"  in  the  Harvard  study.  In  Phase 
I,  a  multi-specialty  panel  of  24  physicians  identified  "same"  (i.e.,  in- 
volving identical  work)  and  "equivalent"  (i.e.,  involving  similar 
amounts  of  work)  services  to  serve  as  links.  The  process  for  identify- 
ing links  was  an  iterative  one,  involving  both  clinical  judgment  and 
empirical  evidence.  The  panel  originally  identified  159  pairs  of  ser- 
vices as  potential  linkages  but  reduced  the  number  to  75  after  elimi- 
nating pairs  whose  elements  came  fi'om  nonsurveyed  specialties  and 
pairs  whose  elements  differed  by  more  than  25  percent  in  average 
time.  The  final  number  was  increased  to  82  (40  same,  42  equivalent) 
after  a  cluster  analysis  on  time  and  work  identified  additional  poten- 
tial links,  seven  of  which  were  approved  by  the  multi-specialty  panel. 

Additional  links  were  developed  in  Phase  II  for  specialties  not  in- 
cluded in  Phase  I,  as  well  as  for  five  specialties  from  Phase  I.  These 
links  were  developed  by  multi-specialty  panels  drawn  from  26  special- 
ties. The  concept  of  "same"  and  "equivalent"  links  from  Phase  I  based 
on  intra-service  work  was  expanded  to  include  four  types  of  links, 
based  on  (1)  intra-service  work,  (2)  total  work,  (3)  intensity  of  work,  or 
(4)  intra-to-total  work.  In  five  multi- specialty  panel  meetings, 
panelists  identified  193  pairs  of  services  from  362  potential  links. 
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The  total  number  of  linkages,  therefore,  was  275  (82  from  Phase  I, 
193  from  Phase  II). 

The  inclusion  of  links  based  on  intensity  has  been  criticized  by  PPRC 
and  others  because  it  assumes  that  work  is  a  simple  product  of  time 
and  intensity.  1^  This  linear  relationship  between  total  work  and  time 
is  questionable.  Furthermore,  it  is  not  clear  from  the  Phase  II  final 
report  how  intensity  links  were  entered  into  the  regression  analysis. 
Finally,  the  sensitivity  of  the  final  common  work  scale  to  changes  in 
links,  or  changes  in  the  work  values  for  links,  has  not  been 
evaluated.  ^^ 


Regression  Model  for  Linking  Services 

For  both  Phase  I  and  Phase  II,  two  specialties  were  typically  con- 
nected by  more  than  one  pair  of  linked  services,  and  the  multiplica- 
tive transformations  demanded  by  each  pair  were  tj^ically  not  the 
same.  This  was  deliberately  done  because  of  the  assumption  that 
there  was  error  of  measurement  in  both  the  estimation  of  magnitude 
and  the  assumption  of  equivalency.  All  of  the  linked  services  were 
combined  in  a  regression  analysis  to  produce  an  estimated  set  of 
transformations  to  move  the  specialty-specific  measurements  to  a 
common  scale.  This  transformation  ensured  that  the  original  rela- 
tionships among  work  values  within  each  specialty  survey  were  pre- 
served. 

For  Phase  II,  the  regression  analysis  to  produce  a  common  work  scale 
employed  input  data  from  40  specialty  surveys  representing  33  dis- 
tinct specialties  (15  from  Phase  II,  3  from  Phase  I  resurveyed  in 
Phase  II,  4  resurveyed  as  special  studies  in  Phase  II,  and  11  from 
Phase  I).  For  specialties  surveyed  in  both  Phases  I  and  II,  results 
from  each  phase  were  treated  as  separate  inputs  into  the  regres- 
sion.^^ 


STEP  3:  CALCULATE  TOTAL  WORK  USING  ESTIMATES  OF 
PRE-  AND  POST-SERVICE  WORK 

Recall  that  for  each  service  included  in  the  Phase  I  and  II  surveys, 
physicians  were  asked  to  estimate  their  intra-service  work  and  time. 


^'^See  the  earlier  discussion  regarding  magnitude  estimation. 

^^Each  of  these  issues  is  addressed  in  detail  in  Section  4  of  this  report. 

^°This  regression  model  and  its  limitations  are  discussed,  along  with  a  presentation 
of  an  alternative  approach,  in  greater  detail  in  Section  4, 
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However,  RVWs  are  based  on  the  total  amount  of  work.  Step  3  pro- 
vided a  means  of  adding  pre-service  and  post-service  effort  to  the  in- 
tra-service  work  to  estimate  the  total  amount  of  work  that  is  the 
RVW  for  a  service. 

In  Phase  I,  estimates  of  pre-  and  post-service  times  were  obtained  in 
the  original  telephone  surveys  for  55  vignettes.  Then,  a  follow-up 
telephone  survey  was  conducted  among  physicians  in  seven  special- 
ties who  participated  in  the  original  survey  to  obtain  additional  esti- 
mates of  pre-  and  post-service  times.  When  combined,  these  surveys 
produced  data  on  pre-  and  post-service  times  for  153  different  ser- 
vices. Regression  analysis  was  used  to  develop  estimates  of  pre-  and 
post-service  times  for  the  remaining  surveyed  services  for  which  only 
intra-service  work  and  time  had  been  obtained.  The  estimates  of  pre- 
and  post-service  times  were  multiplied  by  estimates  of  work  intensity 
(i.e.,  work  per  minute)  to  obtain  final  values  of  pre-  and  post-service 
work.  This  process  produced  estimates  of  total  work  for  all  372  dis- 
tinct services  in  Phase  I. 


Surgical  Services 

The  establishment  of  total  work  from  intra-service  work  was  subject 
to  much  criticism.  PPRC^*^  conducted  a  study  to  separate  surgical 
global  service  into  (1)  pre-operation  visits,  (2)  the  operation  (including 
scrub  work),  and  (3)  post-operation  visits.  They  proposed  using  intra- 
service  work  and  scrub  work  from  the  Harvard  study  as  the  measure 
of  work  for  the  operative  component.  Estimates  of  pre-operative  and 
post-operative  visits  were  obtained  from  specialty  societies,  who  used 
either  a  consensus  process  or  a  committee  process.  PPRC  regards 
these  values  as  more  valid  than  the  Harvard  estimates. 

The  Abt  study  subdivided  pre-service  and  post-service  work  into  sep- 
arate parts  and  estimated  the  work  value  of  each  part,  summing  to 
obtain  a  total  work  value.  Their  results  differed  in  major  ways  from 
the  Harvard  total  work  values,  both  in  terms  of  differences  in  the 
value  of  individual  services  and  in  terms  of  a  systematic  tendency  for 
the  Abt  values  to  be  of  more  extreme  magnitude  (i.e.,  larger  for  highly 
valued  services  and  smaller  for  less  valued  services).  Abt  views  the 
difference  between  their  results  and  the  Harvard  results  as  due  to  a 
"compression"  artifact  of  the  Harvard  method. 

In  response  to  these  criticisms.  Harvard's  Phase  II  developed  refined 
estimates  of  pre-  and  post-service  work  by  defining  the  work  involved 


^^Physician  Payment  Review  Commission  (1990,  1991). 
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in  those  time  periods  more  precisely.  Pre-  and  post-service  work  was 
first  defined  conceptually  as  eight  components,  including  (1)  initial 
consultation,  (2)  hospital  admission  work-up,  (3)  pre-operative  evalu- 
ation, (4)  other  pre-operative  work,  (5)  post-operative  follow-up  on  day 
of  surgery,  (6)  follow-up  visits  in  intensive  care  unit  after  day  of 
surgery,  (7)  follow-up  visits  in  acute  care  unit  after  day  of  surgery, 
and  (8)  post-hospital  follow-up  visits  within  90  days  of  surgery. 

For  data-collection  purposes,  this  conceptual  model  was  collapsed  into 
three  components:  (1)  pre-operative,  (2)  same-day  post-operative,  and 
(3)  office  follow-up.  Data  on  work  and  time  for  these  components 
were  collected  for  selected  services  as  part  of  the  Phase  II  surveys,  as 
well  as  fi:-om  specialty  panels.  A  fixed  value  of  0,  15,  or  25  minutes 
was  assigned  for  other  pre-operative  work,  depending  on  procedure 
and  setting.  The  initial  consultation  and  hospital  admission  work-up 
were  excluded. 

In  Phase  III,  the  conceptual  model  of  pre-  and  post-service  work  was 
further  refined  into  five  components:  (1)  pre-surgical  EM,  (2)  other 
pre-surgical  work,  (3)  post-operative  follow-up  on  day  of  surgery,  (4) 
follow-up  visits  in  hospital  after  day  of  surgery,  and  (5)  follow-up  vis- 
its in  office.  In  this  phase,  direct  estimates  of  work  and  time  were  to 
be  obtained  during  the  pre-operative  and  post-operative  periods  for 
about  300  additional  surgical  procedures,  including  the  number,  du- 
ration, and  work  values  of  visits  before  and  after  surgery.  Estimates 
based  on  the  sum  of  the  five  components  were  to  be  compared  to 
direct  estimates  of  total  work  and  time  for  entire  global  service. 
Finally,  direct  estimates  of  the  two  major  components  of  post-service 
work  (e.g.,  before  and  after  hospital  discharge)  were  to  be  obtained. 

Regression  Models  for  Pre-  and  Post-Service  Work 

Regression  models  were  developed  in  Phase  II  to  estimate  the  three 
components  of  pre-  and  post-service  time  defined  above  as  a  function 
of  (1)  intra-service  work,  (2)  intra-service  time,  (3)  hospital  median 
length  of  stay,  and  (4)  category  of  surgical  service.  Six  models  were 
used — three  for  services  primarily  performed  in  inpatient  settings 
and  three  for  services  primarily  performed  in  outpatient  settings. ^^ 
The  predicted  values  of  pre-  and  post-service  times  obtained  from  the 
regression  models  were  then  multiplied  by  the  work  intensity  values 
for  each  component  to  produce  a  work  value  for  each  component  of  the 


^°The  models  for  pre-service  and  same-day  post-service  were  physician-level  re- 
gressions, whereas  the  models  for  office  follow-up  were  service-level  regressions. 
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service.  ^^  The  total  work  value  was  thus  equal  to  the  sum  of  the  work 
estimates  for  each  component  of  pre-  and  post-service  work. 

The  input  data  used  to  develop  these  regressions  were  not  available 
for  evaluation  as  part  of  this  study.  Obviously,  these  regressions 
merit  examination,  especially  because  Phase  II  pre-  and  post-service 
work  values  are  substantially  higher  than  those  obtained  by  other  re- 
search ers.^^  Other  issues  that  warrant  further  study  include  (1)  the 
assumption  of  constant  work  intensity  across  surgical  services  for 
each  component  of  pre-  and  post-service  work,  and  (2)  the  value  of  ob- 
taining total  work  through  a  "bottom  up"  approach  (i.e.,  component 
analysis)  compared  with  direct  estimation  of  total  work. 

Proposed  Studies  of  Pre-Service  and  Post-Service  Work 

Because  of  uncertainty  about  the  final  global  fee  policy  to  be  adopted 
by  HCFA,  services  were  grouped  into  three  categories:  (1)  invasive 
procedures,  which  include  all  components  of  work;  (2)  endoscopic  pro- 
cedures, which  include  only  work  performed  on  the  day  of  the  proce- 
dure; and  (3)  minor  procedures,  which  exclude  pre-  and  post-service 
work. 

PPRC  has  proposed  to  validate  specialty  society  data  using  claims 
data  from  carriers  that  do  not  include  visits  in  the  global  fee,  data 
from  health  maintenance  organizations  and  multi-specialty  groups, 
and  physician  survey  data.  Also,  PPRC  intends  to  convene  a  panel  of 
physicians  "not  directly  affected  by  payment  reform"  to  assess  the 
face  validity  of  estimates  for  services  where  the  objective  data  for 
comparison  are  inadequate, 

EM  Services 

Data  from  Phases  I  and  II  clearly  indicated  the  inadequacy  of  OPT 
codes  for  EM  services  (i.e.,  visits).  Physician  estimates  of  work  for  the 
same  service  varied  widely  across  specialties,  suggesting  that  a  sin- 
gle, valid,  reliable  RVW  could  not  be  determined  for  most  visit  codes. 

PPRC  has  conducted  its  own  study  of  visit  codes  and  developed  de- 
tailed recommendations  about  how  visit  codes  should  be  modified.^^ 


^^he  work  intensity  values  for  each  component  were  (1)  2.2  for  pre-service,  (2)  3.0 
for  same-day  post-service,  and  (3)  2.5  for  office  follow-up.  Other  pre-service  work  was 
assigned  an  intensity  value  of  0.8. 

^^hysician  Payment  Review  Commission  (1991),  p.  42. 

^^Lasker,  Marquis,  and  Morrow  (1991). 
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The  final  authority  for  revising  these  codes,  however,  rests  with  the 
CPT  Editorial  Board.  The  visit  codes  developed  by  the  CPT  Editorial 
Board  reflect  some  of  the  PPRC  recommendations  (e.g.,  to  include  en- 
counter time  in  the  visit  definition).  The  final  visit  codes  included  in 
the  MFS,  however,  involve  many  more  distinctions  and  categories 
than  proposed  by  PPRC.  The  structure  of  the  new  visit  codes  is  an 
extremely  important  area  for  further  research  and  evaluation  but  was 
beyond  the  scope  of  this  study. 

STEP  4:  MAP  WORK  VALUES  FOR  VIGNETTES  INTO 
HCPCS  CODES 

After  linkage  and  determination  of  total  work  values  for  all  of  the  vi- 
gnettes, the  work  values  must  be  assigned  to  HCPCS  billing  codes. 
Where  there  was  a  one-to-one  mapping  between  a  single  vignette  and 
a  single  HCPCS  code,  this  was  a  straightforward  task.  However,  the 
assignment  was  not  always  straightforward  because  (1)  the  transla- 
tion from  vignette  to  the  appropriate  code  was  not  subject  to  an  un- 
ambiguous set  of  rules,  (2)  some  vignettes  from  the  same  specialty 
were  assigned  to  the  same  HCPCS  code,  (3)  some  vignettes  from  dif- 
ferent specialties  were  assigned  to  the  same  HCPCS  code,  and  (4) 
most  HCPCS  codes  did  not  have  a  vignette  assigned  to  them.  The 
last  problem  is  addressed  in  Step  5,  below;  the  remainder  are  ad- 
dressed here. 

Translating  Vignettes  to  HCPCS  Codes 

A  multi-specialty  team  of  technical  consultants  translated  vignettes 
into  HCPCS  codes.  However,  this  process  was  not  a  straightforward 
task  for  the  following  reasons.  Many  panelists  were  not  thoroughly 
familiar  with  the  CPT  codes  that  form  the  basis  of  the  HCPCS  coding 
system.  Therefore,  a  separate  panel  of  consultants  fi'om  the  AMA 
staff  responsible  for  the  development  and  maintenance  of  CPT  codes 
resolved  disagreements  in  assignments  by  the  original  panelists. 
Another  translation  problem  was  that  some  surgical  services  require 
multiple  codes  to  be  described  accurately.  In  these  cases,  the  work 
values  obtained  for  the  vignette  were  allocated  across  more  than  one 
CPT  code.  Finally,  codes  for  EM  services  were  found  to  have  vastly 
different  work  values  across  specialties.  Work  values  for  EM  services, 
therefore,  were  not  combined  across  specialties. 
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Mapping  Vignettes  with  Common  HCPCS  Codes 

Under  the  MFS,  however,  each  HCPCS  code  can  have  one,  and  only 
one,  RVW.  Therefore,  it  was  important  to  find  a  way  to  combine  work 
values  for  the  same  HCPCS  code.  The  Harvard  study  used  different 
methods  for  dealing  with  vignettes  mapped  to  the  same  CPT  codes 
depending  on  whether  the  common  codes  were  found  within  or  across 
specialties. 

For  services  from  a  common  specialty  that  shared  an  HCPCS  code, 
the  RVW  was  calculated  as  the  arithmetic  mean  of  the  vignette  work 
values.  Although  the  argument  may  be  made  that  the  underlying  as- 
sumption that  each  of  the  vignette  variations  on  the  HCPCS  code  oc- 
cur with  equal  frequency  is  untenable,  no  way  to  estimate  the  true 
prevalence  of  the  variations  exists.  The  mean,  therefore,  seeraSy  faute 
de  mieux,  to  be  the  right  answer.  For  specialties  having  many,  widely 
differing,  vignettes  sharing  a  common  HCPCS  code,^^  the  solution  to 
any  inequity  lies  in  revising  the  code,  not  the  RVWs. 

For  services  from  different  specialties  that  shared  an  HCPCS  code 
(typically  also  paired  as  a  cross-specialty  link),  the  common  RVW  was 
calculated  as  the  volume-weighted  average  (using  Medicare  Part  B 
data)  of  work  values  from  the  realigned  specialty-specific  scales.  For 
example,  if  specialty  X  and  specialty  Y  billed  to  the  same  code  and 
specialty  X  accounted  for  two-thirds  of  the  billings,  then  the  RVW  was 
two-thirds  times  the  specialty  X  work  value  plus  one-third  times  the 
specialty  Y  work  value.  This  averaging  of  work  values  has  the  effect 
of  assigning  RVWs  for  vignettes  that  share  CPT  codes  that  are  not  in- 
consistent with  the  original  ratios  of  work  within  the  specialty.  That 
is,  continuing  our  example,  specialty  X  physicians  might  get  less  for  a 
linked  service  than  the  surveyed  respondents  believe  appropriate  be- 
cause the  same  service  is  also  provided  by  another  specialty  but  at  a 
lower  estimated  work  value.  In  Section  4,  we  present  an  alternative 
definition  of  linkages  and  a  way  to  calculate  RVWs  on  a  common  scale 
that  minimizes  this  type  of  departure  from  the  original  ratios. 

Evaluation  and  Management  Services 

One  type  of  billing  code  common  to  virtually  all  specialties  is  that  for 
EM  service  (originally  numbered  90000  through  90699  in  the  CPT 
coding  system).  As  the  RBRVS  process  evolved,  inadequacies  in  the 
EM  codes  became  apparent. 


^^For  example,  almost  all  "50-minute  hours"  in  psychiatry  share  the  same  HCPCS 
code  of  90844. 
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A  major  study  of  EM  services  was  the  PPRC  examination  of  visits  and 
consultations.  In  this  study ,^3  339  physicians  in  three  specialties 
(internal  medicine,  rheumatology,  and  urology)  were  surveyed  to  ob- 
tain estimates  of  physician  time  spent  with  a  patient  during  the  day 
of  the  visit  for  seven  categories  of  EM  services  (both  hospital  and  of- 
fice).24  Physicians  were  also  asked  for  estimates  of  the  time  related  to 
the  visit,  but  performed  before  or  after  the  day  of  the  visit,  for  five 
categories  of  EM  services.^^  Finally,  physicians  used  magnitude  es- 
timation to  assess  the  total  work  for  each  visit  and  the  proportion  of 
total  work  performed  on  the  day  of  the  visit.  The  survey  was  con- 
ducted using  encounter  forms  for  55  consecutive  office  visits,  hospital 
visits,  and  consultations.  This  survey  produced  encounter  forms  for  a 
total  of  19, 143  visits. 

PPRC  later  convened  a  46-member  consensus  paneP^  to  examine  the 
following  issues:  (1)  relationship  between  different  measures  of  time 
and  total  work,  (2)  differences  in  physician  practice  styles  and  use  of 
nonphysician  providers,  (3)  specific  EM  services  provided  during 
visits  of  a  particular  duration,  (4)  impact  of  specific  variables  on  the 
relationship  between  work  and  time,  and  (5)  common  encounter  times 
for  different  classes  of  visits.  Information  was  collected  from 
panelists  via  telephone  and  mail  surveys,  as  well  as  face-to-face  meet- 
ings. The  panel  used  data  from  Harvard  Phase  I,  the  PPRC  Survey  of 
Visits  and  Consultations,  and  an  AMA  ad  hoc  committee  on  visits  and 
levels  of  service. 

The  panel  recommended  that  EM  codes  recognize  three  classes  of 
visit  (new  patient/initial  care;  established  patient/subsequent  care; 
and  consultation)  and  five  levels  of  service  within  each  class  based  on 
content  and  "typical  encounter  time."  PPRC  refined  the  recommen- 
dations of  the  consensus  panel  into  a  visit  coding  system,  consisting  of 
12  classes  of  visits  and  five  levels  of  services  within  each  class.^'^ 


23Lasker,  Marquis,  and  Morrow  (1991). 

24The  categories  were  (1)  record  review,  (2)  history  and  physical  exam,  (3)  coun- 
seling, (4)  charting  and  dictation,  (5)  contact  with  other  providers,  (6)  patient-specific 
contact  with  house  staff,  and  (7)  scheduling  activities. 

25The  categories  were  (1)  review  of  records,  (2)  talking  to  patient  and  family,  (3) 
charting  and  dictation,  (4)  contact  with  other  providers,  and  (5)  scheduUng  and  obtain- 
ing test  results. 

26The  panel  consisted  of  29  physicians,  5  representatives  from  the  CPT  Editorial 
Panel,  8  representatives  from  Medicare  carriers  and  private  insurers,  2  consumer  rep- 
resentatives, 1  nurse,  and  1  physician  assistant. 

^ 'The  12  visit  classes  include  (1)  new  and  established  patient  office  visits,  (2)  initial 
and  subsequent  hospital  visits,  (3)  initial  and  follow-up  consultations,  (4)  initial  and 
subsequent  nursing  facility  visits,  (5)  initial  and  subsequent  rest  home  visits,  and  (6) 
new  and  established  patient  home  visits.    The  levels  of  service  include  the  following 
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Final  authority  for  implementing  new  CPT  codes,  from  which  the 
HCPCS  codes  are  derived,  comes  from  the  CPT  Editorial  Panel  of  the 
American  Medical  Association.  The  panel  proposed  new  CPT  codes  in 
November  1990,  and  in  January  1991  began  a  pilot  test  of  these  new 
codes  for  the  following  EM  services:  (1)  office  and  outpatient  visits, 
(2)  inpatient  hospital  visits,  and  (3)  consultations.  These  pilot  codes 
include  categories  for  (1)  office  and  outpatient  visits  for  new  patients, 
(2)  office  and  outpatient  visits  for  established  patients,  (3)  initial 
inpatient  hospital  care,  (4)  subsequent  inpatient  hospital  care,  (5) 
initial  consultations,  and  (6)  follow-up  consultations.  Each  category  is 
also  divided  into  3-5  levels  of  care.  As  a  result  of  these  efforts,  a  new 
set  of  EM  codes  were  established  and  assigned  the  numbers  99000 
and  upward.  RVWs  for  these  codes  were  established  by  HCFA  in  a 
meeting  of  small  group  panels  (see  discussion  of  the  latest  steps  at 
the  end  of  this  section). 

STEP  5:  EXTRAPOLATE  WORK  VALUES  TO 
NON-SURVEYED  SERVICES 

Because  most  services  were  not  directly  assessed  in  Phases  I  and  II, 
the  last  step  in  creating  the  RBRVS  was  to  develop  estimates  of  total 
work  for  non-surveyed  services. 


Extrapolation  Using  Charge-Based  Ratios  Within  CPT 
"Families'' 

Extrapolation  in  the  first  two  phases  was  accomplished  by  first  identi- 
fying '^benchmark"  services  from  the  set  of  surveyed  services  that 
could  be  identified  with  a  'family"  of  HCPCS  codes.  Work  values  for 
codes  within  a  family  but  not  surveyed  were  estimated  by  extrapola- 
tion using  the  ratio  of  the  average  allowed  charge  for  the  nonsurveyed 
service  to  the  allowed  charge  for  the  "benchmark"  service.  In  other 
words,  within  a  family  the  ratio  differences  among  allowed  charges 
were  assumed  to  represent  accurately  the  ratio  differences  in  work. 
For  EM  services,  extrapolations  were  specialty-specific,  because  the 
same  CPT  codes  had  RVWs  that  varied  widely  across  specialties. 

In  Phase  I,  this  process  produced  RVWs  for  about  1,400  services,  ac- 
counting for  about  67  percent  of  total  Part  B  allowed  charges  and 
about  80  percent  of  allowed  charges  for  surgical  services.    Phase  II 


measures  of  encounter  time:    (1)  10  minutes,  (2)  20  minutes,  (3)  30  minutes,  (4)  45 
minutes,  and  (5)  60  minutes. 
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refined  the  definition  of  CPT  families  and  used  more  recent  data  for 
calculating  extrapolation  ratios.  These  steps  produced  RVWs  for 
2,024  CPT  codes  (200  surveyed  plus  1,824  extrapolated),  accounting 
for  about  84  percent  of  Part  B  allowed  charges  for  surgical  services. 
When  combined  with  findings  from  Phase  I,  RVWs  were  calculated 
for  2,412  CPT  codes  in  262  famihes. 

In  Phase  II,  extrapolated  values  were  validated  by  comparing  sur- 
veyed values  with  extrapolated  values  in  families  with  more  than  one 
surveyed  service.  This  involved  104  surgical  services  in  39  families. 
After  excluding  extreme  values,  the  average  discrepancy  between 
surveyed  and  extrapolated  RVWs  was  16.2  percent.  Only  about  one- 
third  of  the  extrapolated  values  were  within  10  percent  of  the  sur- 
veyed values. 

Filling  the  Gaps 

Dissatisfaction  with  the  extrapolation  method  led  to  Phase  III  of  the 
Harvard  study,  which  "filled  the  gaps"  in  the  first  two  phases  by  di- 
rectly estimating  work  for  HCPCS  codes  that  had  previously  been  cal- 
culated by  extrapolation. 

Expert  panels  of  15  physicians  per  specialty  were  established  for  26 
specialties.  Several  different  surveys  of  about  50  services  each  asked 
for  estimates  of  total  work  for  a  standard  service  and  other  high-vol- 
ume services  in  each  family,  and  intra-service  work  values  for  all  re- 
maining services  in  each  family,  as  well  as  for  new  services  and 
changing  technologies  or  practice  patterns.  Although  the  proposal  for 
conducting  Phase  III  indicated  that  data  would  be  obtained  by  some 
mix  of  single-round  mail  surveys  and  multiple-round  group  processes, 
the  method  actually  used  has  not  yet  been  published. 

THE  LATEST  STEPS 

The  January  1992  deadline  for  implementation  of  the  MFS  did  not 
allow  enough  time  for  assigning  RVWs  for  every  HCPCS  through 
Phase  III.  Also,  following  the  Notice  of  Proposed  Rule  Making  pub- 
lishing tentative  RVWs,^^  HCFA  received  tens  of  thousands  of 
comments  recommending  revisions  to  the  system.  To  fill  in  the  final 
gaps,  rectify  obvious  errors,  and  establish  values  for  new  HCPCS 


^^Health  Care  Financing  Administration  (1991a). 
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codes  (including  the  new  EM  visit  codes),  HCFA  conducted  a  series  of 
small-group  panels.^^ 

Work  values  for  these  codes  were  obtained  from  panels  of  HCFA 
Carrier  Medical  Directors,  using  a  small-group  discussion  process  and 
employing  estimates  of  work  values  from  the  Harvard  study  as  the 
starting  point  for  the  discussions.  Panelists  first  participated  in  a 
mail  survey,  then  in  a  face-to-face  three-day  meeting  in  groups  of  six 
members.  They  were  provided  with  a  list  of  "reference"  services,  i.e., 
high-volume  services  with  RVUs  that  were  not  under  review,  and 
were  asked  to  provide  estimates  of  total  work  only.  The  small-group 
process  did  not  include  public  voting  or  consensus  on  RVWs;  the  mean 
value  of  individual  ratings  was  taken  as  the  group  rating.  It  is  worth 
noting,  however,  that  the  RVWs  of  the  individual  panel  members 
showed  a  remarkable  convergence  to  an  implied  consensus. 


^^These  panels  were  conducted  by  the  HCFA  Office  of  Programs  and 
Demonstrations,  based  on  recommendations  of  an  earlier  draft  of  this  report.  The  re- 
sults reported  here  come  from  personal  communications  with  HCFA  staff  and  observa- 
tions by  the  authors. 


OBTAINING  MAGNITUDE  ESTIMATES 
FROM  PHYSICIANS 


Phases  I  and  II  of  the  Harvard  study  surveyed,  by  telephone,  a  strati- 
fied random  sample  of  physicians  from  the  AMA  1986  Physician 
Masterfile  to  obtain  magnitude  estimates  of  the  work  required  to 
perform  vignettes.  For  Phase  II,  several  other  survey  methods  were 
used  purely  on  an  experimental  basis,  including  some  that  involve 
differing  degrees  of  respondent  interaction.  The  Phase  II  data  used 
for  the  published  RVWs  did  not  incorporate  data  from  these  alterna- 
tive survey  methods.  As  we  noted  in  Section  2,  the  Harvard  study 
has  drawn  criticism  for  both  the  source  of  the  sample  of  physicians 
and  the  method  of  obtaining  data  from  them. 

In  this  section,  we  examine  three  aspects  of  the  method  for  obtaining 
work  values.  First,  we  look  at  the  choice  of  who  should  provide  esti- 
mates of  work  values.  Second,  in  the  main  part  of  this  section,  we  ex- 
amine different  methods  for  surveying  physicians  to  obtain  work  val- 
ues. Finally,  we  estimate  the  cost  of  obtaining  data  for  the  major 
methods  examined. 

WHO  SHOULD  ESTIMATE  THE  MAGNITUDE  OF  WORK? 

Recall  that  the  Harvard  study  randomly  sampled  physicians  from  the 
AMA  Masterfile  to  obtain  survey  respondents.  This  tactic  was  open 
to  some  criticism.  First,  physicians  can  declare  a  specialty  on  that  file 
without  being  board  certified;  some  of  these  self-declared  specialists 
may  not  have  experience  with  the  services  being  rated.  Second,  a 
specialty  taken  as  a  whole  may  be  too  broad  a  sampling  frame  to  en- 
sure familiarity  with  all  of  the  services  offered  within  that  specialty. 
In  the  study  done  by  Abt  Associates  for  the  Society  for  Thoracic 
Surgery,!  that  single  specialty  was  divided  into  three  subspecialties, 
with  each  subspecialty  given  its  own  set  of  services  to  rate.  Third,  a 
random  sample  of  physicians  may  not  be  adequate  to  the  task — in- 
stead, peer-recognized  experts  might  be  necessary  to  ensure  the 
knowledge  necessary  to  provide  the  ratings. 

Although  these  criticisms  give  rise  to  thought,  no  evidence  demon- 
strates that  the  validity  of  the  Harvard  study  was  compromised  by  its 
survey  sampling  selection.  The  Abt  study  provides  the  strongest  case 


lNoetheretal.(1990). 
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for  more  select  screening  of  survey  respondents;  however,  that  study's 
ratings  of  intra-service  work  did  not  differ  greatly  from  those  of  the 
Harvard  study.  The  large  differences  between  Abt  and  Harvard  for 
calculations  of  total  work  may  be  due  to  the  different  methods  of  mea- 
surement rather  than  the  populations  sampled.  Admittedly  cursory 
analyses  of  the  effects  of  physician  experience  in  Phase  H  of  the 
Harvard  study  showed  no  effects.^  In  addition,  raters  were  always 
free  to  abstain  from  responding  if  they  believed  themselves  unfamil- 
iar with  the  service  portrayed  in  any  vignette. 

Several  critics  have  argued  that  survey  respondents  and  expert  panel 
members  who  rate  services  largely  provided  by  a  particular  specialty 
should  be  nominated  by  that  specialty's  society,  or  at  least  drawn 
from  the  appropriate  section  of  the  Directory  of  the  American  Board 
of  Medical  Specialties.  The  argument  is  that  this  is  the  only  way  to 
guarantee  that  the  panel's  members  will  have  the  direct  experience 
necessary  to  rate  the  services.  That  arrangement  is  an  attractive 
proposition  from  the  point  of  view  of  ease  obtaining  experts,  but  it 
risks  conflict  of  interest.  HCFA  has  a  justifiable  fear  that  specialty 
societies,  aware  of  how  the  RBRVS  process  works,  could  "game"  rat- 
ings so  as  to  raise  their  services'  work  values  relative  to  the  values  of 
other  societies.  Because  the  MFS  is  essentially  a  constant-sum  allo- 
cation scheme  among  physician  specialties,  it  is  susceptible  to  such 
gaming.  Anecdotal  evidence  exists  of  some  gaming  attempts  during 
Phase  n  of  the  Harvard  study.  Thus,  although  specialty  societies 
may  well  provide  the  best  basis  for  recommendations  to  inter-spe- 
cialty expert  panels,  as  the  heterogenous  composition  of  the  panels 
makes  gaming  more  difficult,  the  societies  probably  provide  a  poor 
source  for  potential  survey  respondents  for  single- specialty  panels.^ 

A  new  database  being  developed  at  HCFA  may  provide  the  answer  to 
the  issue  of  recruiting  future  panel  members  or  survey  respondents. 
Beginning  October  1991,  each  physician  receiving  remuneration  from 
Medicare  has  a  Unique  Physician  Identification  Number  (UPIN). 
Merging  the  UPIN  database  with  the  national  historical  database 
that  contains  all  charges  to  Medicare  will  yield  information  about 
which  physicians  are  billing  which  CPT  codes.  This  new  information 
can  be  used  to  identify  experienced  physicians  who  could  be  tapped  as 


^Hsiao  et  al.  (1990). 

^Rumors  have  surfaced  that  representatives  of  specialty  societies  contacted 
Harvard  study  paneUsts  before  meetings  to  impress  on  the  panelists  the  consequences 
of  different  types  of  judgments.  Such  influence  attempts  may  or  may  not  have  oc- 
curred, but  even  their  suggestion  points  to  the  inherent  conflict  of  interest  in  having 
specialty  societies  play  a  major  role  in  determining  relative  work  values. 
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potential  survey  respondents  without  reference  to  membership  in  the 
AMA  or  specialty  societies.  Suitable  geographical,  practice  setting, 
and  other  desirable  panel  stratification  characteristics  can  be  ob- 
tained through  the  Medicare  fiscal  intermediaries. 

The  question  remains  of  whether  panels  should  be  composed  exclu- 
sively of  procedure  performers  or  of  some  mix  of  performers  and  non- 
performers  (given  that  the  latter  would  be  familiar  with  the  procedure 
through  education,  training,  observation,  and  conversation).  We  be- 
lieve that  as  the  RBRVS  is  revised,  some  mix  of  performers  and  oth- 
ers is  necessary.  For  procedures  that  are  well  understood  by  the  gen- 
eral medical  community,  the  heterogenous  panel  is  better  for  locating 
relative  work  on  the  common,  multi-specialty  scale  of  values. 
However,  for  procedures  that  require  deeper  technical  understanding, 
the  performers  are  uniquely  qualified  to  estimate  the  RVWs.  If  these 
highly  specialized  services  are  estimated  in  the  context  of  a  well-de- 
fined, previously  established  scale  of  values  for  the  more  general  pro- 
cedures, the  opportimities  for  gaming  are  severely  constrained.^ 

GROUP-BASED  METHODS  FOR  OBTAINING  WORK  VALUES 

As  part  of  Phase  II,  the  Harvard  study  investigated  the  possibility  of 
substituting  small  groups  in  place  of  the  telephone  surveys  for  esti- 
mating RVWs.  This  investigation  was  to  assess  the  validity  of  the 
single-round  telephone  survey  method  and  to  explore  less  costly  and 
more  efficient  alternatives.^  The  study  found  that  the  more  the  pan- 
elists could  interact,  the  further  their  RVWs  deviated  from  the  figures 
provided  in  Phase  I.  The  Harvard  study  concluded  that  this  deviation 
was  a  bias  resulting  from  group  processes  and  that  an  individual- 
based  method  (telephone  or  mail  survey)  was  preferable  to  other  data- 
collection  methods.  We  agree  in  general  with  this  test  of  methods. 
Upon  examination  of  the  results,^  however,  we  question  whether  the 
Phase  I  telephone  survey  results  should  be  taken  as  a  gold  standard 
for  method  comparison.  Absent  any  evidence  that  such  a  gold  stan- 
dard was  established,  we  examine  the  social  psychological  literature 
on  judgment  of  numerical  estimations  to  see  how  it  might  guide 
method  selection.  That  literature  argues  for  the  use  of  small  groups, 
either  Delphi  or  face-to-face,  for  revising  estimates  of  RVWs.    This 


^Even  for  sessions  rating  highly  technical  services,  one  or  two  panel  nonspecialist 
panel  members  who  participate  to  only  a  minor  extent  might  assure  that  the  special- 
ists' ratings  remain  within  the  bounds  of  reason. 

^Hsiao  et  al.  (1990),  p.  669. 

^Hsiao  et  al.  (1990),  Chapter  11. 


24 


conclusion  contrasts  with  the  Harvard  conclusion  that  a  single-round 
mail  survey,  producing  results  closest  to  the  telephone  survey  "gold 
standard,"  will  suffice. 

In  this  subsection,  we  first  summarize  the  comparison  of  three  meth- 
ods of  obtaining  RVWs  conducted  as  part  of  Phase  II  of  the  Harvard 
study.  Then,  we  review  the  recent  social  psychological  literature  on 
collective-individual  versus  group-based  judgment  methods,  concen- 
trating on  the  results  of  empirical  studies.  We  conclude  with  a  rein- 
terpretation  of  the  Phase  II  small-group  process  data  and  recommend 
how  to  collect  RVWs  in  the  future. 

Individuals  Versus  Small  Groups  in  the  Harvard  Study 

In  Phase  II  of  the  Harvard  study,  three  methods  for  generating  physi- 
cians' estimates  of  RVWs  were  compared  with  values  obtained 
(primarily  in  Phase  I)  from  national  telephone  interview  surveys. 
Three  panels  of  general  surgeons  were  selected  from  a  pool  of  60  nom- 
inees. Each  of  the  six  major  regional  surgical  societies  submitted  10 
nominees  to  this  pool,  resulting  in  a  mix  of  academic  and  community- 
based  surgeons.  Eleven  of  these  surgeons  made  up  Panel  A  and  par- 
ticipated in  a  combined  Delphi  and  face-to-face  group.  Nineteen  sur- 
geons formed  Panel  B  and  took  part  in  a  Delphi  group  with  multiple 
rounds  of  ratings  interspersed  with  feedback.  The  29  Panel  C  partic- 
ipants completed  a  single-round  mail  survey.*^ 

Delphi  Process.  A  Delphi  method  is  a  quasi-small-group  process 
where  participants  receive  anonymous  and  limited  feedback  on  each 
others'  ratings.^  Typically,  participants  in  a  Delphi  method  are  sur- 
veyed in  the  following  steps: 

1.  Each  individual  panelist  responds  to  the  questions.  The  partici- 
pants' answers  are  then  summarized  by  the  average  and  a  distri- 
bution of  responses. 

2.  The  results  of  the  first  round  are  returned  to  the  individual  pan- 
elists for  an  iteration.  After  reviewing  the  feedback  of  first-round 
results  and  the  relative  location  of  their  individual  initial  re- 
sponses with  respect  to  those  of  their  peers,  the  panelists  may 
modify  their  beliefs. 

3.  Multiple  iterations  of  the  feedback  process  typically  produce  con- 
vergence to  a  consensus. 


^One  nominee,  originally  slated  for  Panel  B,  dropped  out  of  the  study. 
^Dalkey,  Brown,  and  Cochran  (1969). 
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This  method  has  at  least  three  advantages.  First,  a  Delphi  method 
may  yield  more  accurate  judgments  than  the  combined  results  from  a 
single-round  survey  of  individuals.^  This  occurs  because  respondents 
can,  anonymously  and  therefore  without  loss  of  face,  revise  their 
judgments.  Second,  the  multiple-round  process  may  produce  more 
ideas  than  conventional  face-to-face  discussion  groups.  Delphi  pan- 
elists are  not  subject  to  any  face-to-face  social  influence  processes  that 
might  constrain  the  development  of  alternative  judgments.  ^^  Third, 
Delphi  transactions  do  not  require  the  synchronicity  of  telephone  or 
face-to-face  meetings  and  therefore  are  not  as  expensive  to  conduct. 

The  major  disadvantage  of  a  pure  Delphi  process  is  that  members  are 
not  able  to  discuss  the  reasons  for  their  decisions  and  revisions  among 
themselves.  Tasks  that  involve  a  number  of  facts  to  be  considered  or 
recalled  are  well  suited  for  freely  interacting  groups.  Members' 
shared  knowledge  and  error-checking  discussions  serve  to  increase 
the  accuracy  of  these  groups  over  groups  with  limited  communica- 
tion^i  and  over  collective-individual  results. ^^ 

In  the  Harvard  study,  surgeons  in  Panel  B  rated  the  intra-service 
work  of  55  vignettes  in  three  Delphi  rounds  by  mail.  Thirty-eight  of 
these  vignettes  were  taken  from  the  Phase  II  resurvey  of  general 
surgery;  the  remaining  17  were  generated  in  a  similar  manner. 

Combined  Delphi  and  Face-to-Face  Process.  Panel  A  combined 
features  of  a  Delphi  method  with  a  face-to-face  discussion  leading  to 
individuals'  re-rating  of  vignettes.  Panelists  performed  two  rounds  of 
Delphi  by  mail  before  meeting  together.  At  the  meeting,  they  were 
instructed  to  try  to  reach  a  consensus.  After  one  discussion  session 
and  ratings  ("Round  3"),  the  panelists  met  for  a  second  time  ("Round 
4")  to  hammer  out  a  consensus  on  the  few  vignettes  for  which  consen- 
sus had  not  yet  been  obtained.  Panels  A  and  B  rated  the  same  55  vi- 
gnettes. 

Single-Round  Mail  Survey.  The  simplest  method  investigated  in 
Phase  II  of  the  Harvard  study  was  a  single-round  mail  survey;  as 
such,  it  was  not  really  a  group  process  (although  labeled  as  such  in 
the  report)  but  a  survey  of  individuals.   Participants  completed  a  sur- 


^ibid. 

l^For  example,  Burton  (1987);  McGrath  (1984). 

^^For  example,  Laughlin  and  McGlynn  (1986). 

l^For  example,  Michaelson  et  al.  (1989);  Stephenson,  Clark,  and  Wade  (1986); 
VoUrath  et  al.  (1989). 
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vey  and  their  individual  responses  were  averaged  to  produce  a  central 
statistic  that  was  considered  representative  of  the  group. 

The  obvious  advantages  are  in  the  simple  logistics  and  lowered  costs. 
A  single-round  mail  survey  is  less  cumbersome  and  has  fewer  trans- 
actions than  telephone  interviews,  successive  mailings,  or  face-to-face 
meetings.  If  data  from  such  a  mail  survey  do  not  differ  from  data 
that  would  be  obtained  from  a  more  expensive  method,  the  choice  of  a 
mail  survey  is  clearly  appropriate. 

The  surgeons  in  Panel  C  did  not  receive  the  same  booklet  as  the 
participants  in  Panels  A  and  B;  instead,  they  rated  25  vignettes,  only 
some  of  which  were  from  the  set  of  55  rated  by  the  other  panels.  They 
also  rated  pre-  and  post-service  work  for  those  vignettes.  Conse- 
quently, their  results  could  be  compared  only  to  the  national  survey. 

Results  of  the  Harvard  Comparison.  For  both  Panels  A  and  B, 
feedback  about  the  prior  collective-individual  ratings  tended  to  pro- 
duce, as  anticipated,  a  convergence  to  a  consensus.  Outliers  tended  to 
move  toward  the  center  on  successive  rounds  and  the  differences  be- 
tween the  ratings  given  by  the  individual  surgeons  were  dramatically 
reduced.  However,  during  the  interactive  process,  the  convergence 
point  diverged  over  iterations  from  the  telephone  survey  baseline  av- 
erage. That  is,  the  absolute  "degree  of  disagreement"  between  a  given 
round  and  the  national  survey  over  all  surveyed  services  increased  as 
the  number  of  rounds  increased.  Typical  disagreement  between  panel 
and  national  survey  ratings  are  demonstrated  by  the  work  value  dif- 
ferences for  the  general  surgery  specialty  given  in  Table  1.  For  both 
Panels  A  and  B,  the  later  the  round,  the  more  distant  the  panelists' 
median  judgment  from  the  national  survey  values. ^^  Consequently, 
the  feedback  that  resulted  in  the  reduction  of  differences  between  in- 
Table  1 

Percentage  Absolute  Difference  Between 

Phase  n  Group  and  National  Survey 

General  Surgery  RVWs 


Panel       Round  1       Round  2       Round  3       Round  4 


A 

12.6 

14.6 

21.0 

B 

11.0 

11.0 

15.2 

(Avs.B 

12.8) 

23.8 


^^Hsiao  et  al.  (1990),  Table  11.5,  p.  697. 
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dividual  panelists  contributed  to  an  increase  in  differences  between 
the  panels  and  the  national  survey.  Furthermore,  the  combined 
method  that  allowed  knowledge-sharing  and  error-checking  discus- 
sions generated  the  greatest  differences  from  the  national  survey  in 
estimates  of  work  values.  In  contrast,  the  collective  intra-service 
work  ratings  from  the  Panel  C  single-round  survey  compared  favor- 
ably with  those  of  the  national  survey. 

The  Harvard  study  interpreted  these  differences  as  a  deviation  from 
the  national  survey  "gold  standard"  and  a  reason  to  reject  multi-round 
group  methods.  Our  view  is  that  a  deviation  is  not  necessarily  invali- 
dating; the  national  survey  standards  may  well  be  less  accurate. 
Without  a  true  standard,  the  comparison  alone  cannot  determine 
which  set  of  figures  are  the  more  valid.  Unfortunately,  there  are  no 
other  studies  comparing  individual  vs.  group  decision  processes  that 
employ  physicians  as  subjects  or  address  the  value  of  work.  There- 
fore, to  shed  some  light  on  this  issue,  we  reviewed  the  recent  social 
psychological  literature  to  examine  studies  directly  comparing  indi- 
vidual-based collective  decisions  to  group-based  ones. 

A  Review  of  Individual  Versus  Group  Judgment  Methods 

We  reviewed  the  empirical  literature  of  the  past  ten  years  for  compar- 
isons between  individual  and  group  judgment  methods.  Studies  older 
than  ten  years  of  age  can  generally  be  found  in  textbooks  and  so 
provide  a  framework  rather  than  new  information,  so  we  focused  on 
specific  recent  explicit  comparisons  of  the  two  types  of  judgment 
methods. 

Intellective  Versus  Judgmental  Tasks.  A  potentially  important 
distinction  in  categorizing  decision  tasks  is  that  made  by  Laughlin 
and  his  coworkers  of  intellective  versus  judgmental  tasks. i'* 
Generally,  intellective  tasks  and  judgmental  tasks  are  considered  to 
be  at  opposite  ends  of  an  abstract  fact-verification  continuum. 
Intellective  tasks  are  those  for  which  a  correct  answer  can  be  demon- 
strated, whereas  judgmental  tasks  have  no  demonstrably  correct  an- 
swer, but  instead  generate  responses  based  on  the  beliefs,  feelings,  or 
guesses  of  the  decisionmakers.^^  One  can  illustrate  the  continuum  by 
looking  at  the  types  of  tasks  that  have  been  investigated: 


I'* Laughlin  (1980);  Laughlin  and  Futoran  (1985);  Laughlin  and  McGlynn  (1986). 
See  also  McGrath  (1984);  Stasser,  Kerr,  and  Davis  (1989). 

^^he  distinction  between  intellective  and  judgmental  is  itself  not  always  crisp.  For 
example,  although  in  baseball  there  are  explicitly  written  rules  about  what  makes  a 
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Intellective  pole 

Almanac-type  questions^^ 

Logical  rule  induction 

Learning  (e.g.,  recall)  tasks 

Jury  decisions 

Personnel  decision  tasks 

Moral  choice  dilemmas 
Judgmental  pole 

If  the  purpose  of  comparing  individual  and  group  decision  tasks  is  the 
accuracy  of  the  decision,  then  the  criterion  for  intellective  tasks  is 
easy;  the  right  answer  is  generally  known,  if  not  necessarily  by  the 
group  members.  For  almanac-type  questions,  one  looks  in  the  al- 
manac and  for  recall  tasks,  the  list  of  items  to  be  recalled  is  known  by 
the  instructor  or  experimenter.  However,  for  judgmental  tasks, 
where  there  may  not  be  a  "correct"  answer,  the  measure  of  accuracy 
needs  to  be  defined  and  may  be  fairly  indirect.  Each  experimental 
laboratory  has  defined  its  own  measure  of  comparison  for  studies  of 
judgmental  tasks,  and  they  must  each  be  considered  separately. 

The  presence  or  absence  of  an  empirically  verifiable  correct  answer 
has  led  to  different  predictions  about  the  superiority  of  individual 
versus  group  decision  processes.  When  a  correct  answer  may  be 
stated,  then  a  "truth  wins"  type  of  decision  rule  may  be  adopted  by 
the  group,  so  that  if  one  member  can  provide  that  answer,  the  group 
will  move  to  it.  Therefore,  for  intellective  tasks,  groups  should  be  su- 
perior to  individuals.  But  when  no  verifiably  correct  answer  is 
known,  groups  might  be  more  vulnerable  to  the  types  of  noninforma- 
tionally  based  social  influence  processes  hypothesized  to  occur  in  con- 
ventional face-to-face  discussion. ^"^  For  example,  some  panel  members 
may  sway  the  opinion  of  other  members  who  previously  held  better 
judgments.  According  to  this  logic,  a  survey  of  individuals  should 
produce  better  decisions  than  a  panel  dealing  with  the  same  issue. 


pitch  a  ball  or  a  strike  (therefore,  especially  in  these  days  of  instant  replay,  making  the 
task  appear  to  be  an  intellective  one),  the  reaUty  for  the  pitcher  and  batter  is  that  a 
pitch's  status  is  strictly  determined  not  by  its  physical  location  but  by  what  the  umpire 
says  it  is  (making  the  task  a  judgmental  one). 

^^For  example,  "What  is  the  height  of  Mount  Kihmanjaro?" 

I'^For  example,  Janis  (1972);  Stasser,  Kerr,  and  Davis  (1989). 
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An  important  question  is  whether  the  estimation  of  RVWs  is  more  of 
an  intellective  or  a  judgmental  task.  Generally,  quantitative  estima- 
tion tasks  such  as  this  have  been  placed  within  a  typology  of  intellec- 
tive tasks, 1^  but  the  question  merits  deeper  discussion.  On  the  one 
hand,  because,  like  jury  verdicts,  the  amount  of  work  to  perform  a 
medical  service  is  what  a  panel  declares  it  to  be,  the  task  may  be 
thought  of  as  a  judgmental  one.  On  the  other  hand,  there  exist  em- 
pirically observable  features  of  work  (such  as  the  time  to  perform  it, 
the  time  needed  to  learn  how  to  perform  the  service,  and  the  complex- 
ity of  the  procedure)  that  lead  to  the  existence  or  more  correct  versus 
more  incorrect  possible  work  values.  In  addition,  some  potential  work 
values  may  be  considered  egregiously  incorrect,  as  when  the  work  to 
perform  two  separate  tasks  is  less  than  the  work  to  perform  just  one 
of  those  tasks. ^^  These  characteristics  of  the  task  pull  it  toward  the 
intellective  pole.  Our  understanding  of  the  task  leads  us  to  believe 
that  estimating  RVWs  falls  in  the  middle  of  the  intellective-judgmen- 
tal continuum,  possibly  slightly  toward  the  intellective  side.  Others 
might  have  differing  viewpoints  on  this  matter;  in  any  event,  it 
seemed  important  to  examine  both  intellective  and  judgmental  tasks. 

Results  of  the  Literature  Search.  Our  literature  search  for  recent 
empirical  tests  of  collective-individual  versus  interacting-group  deci- 
sions yielded  23  journal  articles  published  within  the  last  ten  years. 
Of  these,  seven  were  intellective  tasks  and  16  were  judgmental. 

The  seven  intellective  task  experiments  included  three  rule  induction 
experiments,  one  recall  of  a  simulated  police  interrogation,  one  recall 
of  a  mock  trial,  one  applied  problem- solving  in  a  contextually  relevant 
work  setting,  and  one  competitive  resource-sharing  game.^^  In  six 
of  these  seven  studies,  the  results  showed  interactive  group  process- 
ing to  be  superior  to  the  aggregated  individual  outputs  for  these  in- 
tellective tasks.  Groups  induced  a  greater  number  of  correct  rules 
than  individuals  as  a  result  of  better  hypothesis-evaluation  and  error- 
checking.  Groups  were  better  than  the  average  individual  at 
recalling  information  from  given  scenarios.  And  group  outputs  were 
more  accurate  than  the  best  or  average  individual  decision  when  the 
tasks  involved  solving  contextually  relevant  and  consequential 
problems  related  to  work.  Only  one  of  the  seven  studies  yielded 
results  not  favoring  group-based  decisions  over  collective-individual 


l^cGrath  (1984). 

^^Such  instances  have  arisen  in  early  versions  of  the  RBRVS. 

^^The  studies  were,  respectively,  Laughlin  and  Futoran  (1985);  Laughlin  and 
McGlynn  (1986);  Tindale  (1989);  Stephenson,  Clark,  and  Wade  (1986);  Vollrath  et  al. 
(1989);  Michaelson,  Watson,  and  Black  (1989);  and  Irwin  et  al.  (1988). 
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ones,  and  the  results  there  were  equivocal.  In  that  task,  where 
individuals  and  groups  of  two  or  three  persons  played  mixed-motive 
games,  individual  males  and  female  dyads  were  able  to  make  more 
correct  decisions  than  male  dyads,  female  individuals,  or  three-person 
groups  of  either  gender. 

The  observations  of  group  process  in  these  studies  of  intellective  tasks 
further  supported  the  conclusion  of  the  superiority  of  group-based  de- 
cisions. In  addition  to  generating  greater  accuracy  and  better  hy- 
pothesis evaluation,  group  members  tended  not  to  be  highly  confident 
of  incorrect  answers.  Rather,  group  members  working  on  intellective 
tasks  generally  succeeded  at  error-checking  and  convincing  other 
members  to  select  the  correct  answer,  in  keeping  with  the  "truth 
wins"  type  of  social  decision  process  hypothesized  for  such  tasks. 
Moreover,  the  intellective  task  group  outcomes  were  not  affected  by 
member  ability  or  status  differences  when  those  differences  conflicted 
with  a  correct  answer.  For  example,  in  their  study  of  222  project 
teams,  Michaelson  et  al.  found  that  not  only  did  all  groups  outperform 
their  average  member,  but  215  of  the  groups  outperformed  their  best 
member  with  respect  to  a  comparison  of  group  scores  and  best  indi- 
vidual scores. 

The  16  judgmental  tasks  include  seven  studies  of  mock  jury  deci- 
sions,2i  five  studies  of  moral-choice  dilemmas,^^  one  study  of  election 
decisions,23  and  three  studies  of  cognitive  bias. ^^ 

Mock  jury  paradigm.  A  mock  jury  study  assembles  a  group  to  play 
the  role  of  a  jury  and  presents  them  with  evidence  in  some  abstract 
form,  such  as  a  written  summary  or  a  videotape.  The  jury  may  be 
asked  for  individual  estimates  of  guilt  or  innocence  (for  criminal 
cases)  or  amount  and  degree  of  liability  (for  civil  cases),  as  well  as 
about  confidence  in  their  estimates.  The  mock  jury  may  then  deliber- 
ate to  a  formal  verdict  or  further  individual  judgments.  The  group 
versus  individual  comparison  for  this  paradigm  was  the  most  mixed; 
further  scrutiny  showed  that  the  conclusions  depended  heavily  on  the 
particular  means  of  comparison.  The  mock  jury  studies  that  tested 
possible  biasing  effects  of  group  process  on  jury  deliberations  found 
little  evidence  for  such  effects  and  concluded  that  group  decisions  did 


2lBankart  and  Powers  (1986);  Davis  et  al.  (1984);  Davis  et  al.  (1989);  Hinsz  et  al. 
(1988);  Kerr  and  Huang  (1986);  Ono  and  Davis  (1988);  and  Tindale  et  al.  (1990). 

22Dukericli  et  al.  (1990);  Meyers  (1989a,  1989b);  Nichols  and  Day  (1982);  and 
Turner,  Wetherell,  and  Hogg  ( 1989). 

23stasser  and  Titus  (1987). 

^^Glisson  (1987);  Stasson  and  Davis  (1989);  and  Stasson  et  al.  (1988). 
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not  differ  from  what  would  obtain  from  taking  averages  of  individual 
decisions.  For  example,  the  mock  jury  studies  by  Davis  and  his 
coworkers  tested  the  effects  of  the  distribution  of  individual  opinion 
and  the  order  and  timing  of  the  polling  sequence  on  the  final  verdict 
outcomes  of  six-person  juries.  Individuals'  personal  opinions  were 
obtained  before  placing  them  in  groups.  Changes  in  group  member 
opinion  were  best  characterized  as  occurring  at  the  group  level,  not  as 
a  result  of  any  suspected  biasing  factors  such  as  exaggeration  of  indi- 
vidual responses,  sequence  of  individual  expression  or  opinion,  or 
social  pressure  of  majority  influence.  The  studies  showing  superior 
results  for  groups  examined  the  number  and  quality  of  arguments 
presented  in  the  group  discussions,  which  were  more  and  better  than 
those  posed  by  individuals.  The  single  study  favoring  individual 
judgments  found  individuals  to  be  "more  fair"  than  the  groups  by 
awarding  more  similar  amounts  to  male  and  female  victims  for  the 
same  case;  the  authors  note  that  the  result  may  be  in  part  due  to  a 
requirement  that  the  group  decision  be  unanimous. 

Choice  dilemmas.  A  choice  dilemma  problem  is  one  in  which  two 
conflicting  values  are  posed  as  mutually  exclusive  choices.  For  exam- 
ple, in  a  risk  dilemma,  security  and  modest  worth  may  be  set  against 
risk  and  high  gain  as  a  person  is  asked  to  choose  between  two  differ- 
ent jobs.  For  another  example,  in  a  moral  dilemma,  individual  need 
may  be  set  against  social  rules  as  a  person  has  to  choose  between 
obeying  the  law  and  helping  a  relative  in  distress.  For  such  tasks, 
where  there  is  no  correct  answer  in  any  real  sense  of  the  term,  inves- 
tigators have  looked  at  the  process  by  which  decisions  are  made.  A 
"good"  process  is  one  in  which  particular  individuals  do  not  influence 
others  by  virtue  of  their  status  or  dominance  of  the  conversation,  but 
where  influence  is  through  the  variety  and  quality  of  the  discussion. 
For  example,  Meyers  examined  the  relative  contributions  of  individu- 
als' previous  opinions  and  the  number  and  nature  of  discussion  argu- 
ments in  predicting  the  same  individuals'  later  opinions  and  found 
that  the  variety  of  arguments  explained  the  group  decision  more  than 
either  individuals'  previous  opinions  or  the  number  of  times  an  argu- 
ment was  expressed. 

Election  decisions.  The  study  of  election  decisions  considered  how 
well  information  was  shared  in  a  group  considering  candidates  for  an 
election.  When  most  of  the  information  about  the  candidates  was 
dispersed  (privately  known)  across  the  group,  more  information  was 
publicly  shared  than  when  most  of  the  information  was  shared  before 
the  group  meeting.  Moreover,  individuals  recalled  more  information 
supportive  of  the  group  decision  than  contrary  to  the  group  decision 
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in  a  post-decision  recall  task.  The  authors  concluded  that  the  face-to- 
face  discussions  were  a  poor  way  to  share  information. 

Cognitive  bias.  The  studies  of  cognitive  bias  examined  the  relative 
susceptibility  of  groups  and  individuals  to  factors  that  bias  cognitive 
judgments.  These  factors  include  commitment  to  the  task  at  hand  as 
well  as  the  various  cognitive  biases  examined  and  explored  over  the 
past  20  years  in  the  cognitive  psychology  literature.^^  The  Glisson 
study  found  that  the  commitment  to  task  of  workgroups  could  not  be 
characterized  by  the  aggregated  level  of  commitment  of  their  individ- 
ual members  but  instead  was  a  fimction  of  the  variety  of  skills  repre- 
sented by  the  members  of  the  group.  The  study  by  Stasson  and 
coworkers  favoring  groups  showed  that  groups  produced  more  and 
better  arguments,  which  influenced  cognitive  judgments.  The  study 
showing  no  advantage  for  either  individuals  or  groups  showed  that 
groups  and  individuals  were  equally  susceptible  to  cognitive  biases 
such  as  availability,  representativeness,  and  anchoring. 

Implications.  Table  2  summarizes  the  results  of  the  literature  re- 
view. For  intellective  tasks  and  to  a  lesser  extent  also  for  judgmental 
tasks,  group-based  decision  processes  were  preferred  to  collective- 
individual  processes.  The  finding  for  intellective  tasks  was  as  antici- 
pated, but  the  similar  finding  for  judgmental  tasks  is  mildly  surpris- 
ing. It  appears,  as  one  study  put  it,  that  instead  of  groups  being 
subject  to  feared  undue  influences,  "a  heretofore  unsuspected  robust- 
ness against  certain  procedural  influences  during  group  consensus 
achievement  is  heartening."^^ 

Although  none  of  the  studies  surveyed  is  a  close  match  to  the  task  of 
estimating  RVWs,  the  conclusion  suggests  that,  in  the  absence  of 
more  directly  focused  evidence,  group-based  methods  (either  Delphi 
or  mixed)  that  permit  the  raters  to  interact  might  be  a  better  choice 
for  revisions  to  the  RBRVS  than  the  mail  survey  of  Panel  C  or  the  na- 
tional telephone  survey  used  for  the  bulk  of  Phases  I  and  II.  To  the 
extent  that  producing  RVWs  is  regarded  as  an  intellective  task,  this 
could  be  considered  a  strong  recommendation.  If  producing  RVWs  is 
considered  more  of  a  judgmental  task,  the  recommendation  is — ^be- 
cause of  the  lack  of  an  empirically  defined  "correct"  answer — weaker 
but  still  in  the  direction  of  groups.  Note  that  this  recommendation  is 
for  group  interaction  before  individual  ratings,  not  for  group  consen- 


^^For  example,  Kahneman,  Slovic,  and  Tversky  (1982). 
26Davis  et  al.  (1989),  p.  1011. 
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Table  2 
Collective-Individual  Versus  Group-Based  Methods 


Study 

Decision  Type 

Favored 

Intellective  Tasks 

Laughlin  and  Putoran  (1985) 

Rule  induction 

Group 

Laughlin  and  McGlynn  (1986) 

Rule  induction 

Group 

Tindale  (1989) 

Rule  induction 

Group 

Michaelson,  Watson,  and  Black  (1989) 

Recall;  problem  solving 

Group 

Stephenson,  Clark,  and  Wade  (1986) 

Recall;  interrogation 

Group 

Vollrathetal.(1989) 

Recall;  mock  jury 

Group 

Irwin  et  al.  (1988) 

Shared  resources 

Neither 

Judgmental  Tasks 

Bankart  and  Powers  (1986) 

Mock  jury 

Individual 

Davis  et  al.  (1984) 

Mock  jury 

Neither 

Davis  et  al.  (1989) 

Mock  jury 

Group 

Hinsz  et  al.  (1988) 

Mock  jury 

Neither 

Kerr  and  Huang  (1986) 

Mock  jury 

Group 

Ono  and  Davis  (1988) 

Mock  jury 

Group 

Tindale  et  al.  (1990) 

Mock  jury 

Group 

Dukerichetal.(1990) 

Moral  choice 

Group 

Meyers  (1989a) 

Moral  choice 

Group 

Meyers  (1989b) 

Moral  choice 

Group 

Nichols  and  Day  (1982) 

Moral  choice 

Group 

Turner,  Wetherell,  and  Hogg  (1989) 

Moral  choice 

Group 

Stasser  and  Titus  (1987) 

Election  decisions 

Individual 

Glisson  (1987) 

Cognitive  bias 

Group 

Stasson  and  Davis  (1989) 

Cognitive  bias 

Group 

Stassonetal.  (1988) 

Cognitive  bias 

Neither 

sus  ratings.  The  mixed  evidence  for  the  mock  jury  studies,  where 
group  judgments  replaced  individual  ones,  suggests  caution  before  at- 
tempting to  insist  that  the  interacting  groups  reach  a  consensus. 

Whether  face-to-face  or  Delphi  methods  are  employed  for  revising  the 
RBRVS  probably  does  not  matter  too  much  in  terms  of  affecting  the 
values  obtained,  given  that  these  two  types  of  method  produced  simi- 
lar degrees  of  deviation  from  the  individual-based  methods  in  the 
Phase  II  experiment.  Other  factors  such  as  ease  of  obtaining  the 
sample  of  physicians  and  the  cost  of  obtaining  data  should  be  the 
primary  consideration. 


34 


THE  COSTS  OF  DIFFERENT  METHODS  OF  DATA 
COLLECTION 

Future  revisions  of  the  RBRVS  will  necessitate  obtaining  RVWs  from 
physicians.  To  better  decide  which  method  of  obtaining  RVW  data  to 
choose,  we  developed  cost  estimates  for  collecting  such  data  from 
physicians,  using  four  data-collection  methods: 

•  Method  1:   in  tend  ewer- administered,  one-roimd  telephone  survey. 

•  Method  2:    one-round  mail  survey  (self-administered 

questionnaire). 

•  Method  3:   two-round  mail  survey  (self-administered 

questionnaires). 

•  Method  4:    one-round  mail  survey  (self-administered  question- 

naire) with  panel  follow-up. 

We  assume  that  the  data  collection  would  be  to  revise  and  update  the 
RBRVS  and  not  to  restructure  the  whole  scale.  Therefore,  we  as- 
sumed that  600  RVWs  would  have  to  be  obtained.  We  further  as- 
sumed that  the  panels  would  ask  only  for  the  total  work  for  a  service 
and  that  the  RVWs  would  be  provided  on  the  existing  common  scale 
of  measurement,  thereby  eliminating  the  need  for  links.  For  each 
method,  we  determined  a  sample  size  that  would  produce  approxi- 
mately equivalent  between-physician  standard  deviations  of  work 
values. 2 "7  Following  a  general  description  of  the  data-collection 
methods,  we  present  estimated  costs  for  each  method  and  outline  the 
assumptions  for  these  estimates. 

Data-Collection  Methods 

Method  1:  Telephone  Survey.  This  data-collection  method  follows 
the  "gold  standard"  used  by  the  Harvard  study  for  Phase  I  and  Phase 
II  of  the  national  surveys.  It  features  a  series  of  one-round  telephone 
surveys  administered  by  trained  interviewers,  with  each  survey 
yielding  100  completed  protocols.  Before  the  actual  interviews, 
respondents  receive  a  survey  packet  made  up  of  a  copy  of  the  survey 
and  an  endorsement  letter.  To  enhance  response  rates,  respondents 
receive  in  advance  a  check  for  $20  for  their  participation  in  an 


^^Ideally,  we  would  like  to  have  an  estimate  of  between-group  standard  deviations. 
But  all  methods  developed  so  far  have  only  had  one  group  per  specialty. 
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estimated  40  minute  telephone  interview  to  rate  50  services.^® 
Interviewers  administer  the  survey  over  the  telephone,  recording 
responses  on  a  paper-pencil  instrument,  which  is  edited  for  data 
entry. 29  Personal  computer  software  and  optical  scanning  equip- 
ment is  used  to  log  completed  surveys  and  track  survey  progress. 

Method  2:  One-Round  Mail  Survey.  This  data-collection  method 
involves  mailing  respondents  self-administered  questionnaires,  to 
cover  the  same  services  as  asked  for  in  Method  1.  The  goal  is  to  have 
100  completed  forms  for  each  of  12  surveys.  To  encourage  an  opti- 
mum response  rate,  respondents  receive  an  initial  telephone  call,  fol- 
lowed in  the  mail  by  an  initial  survey  packet  made  up  of  the  survey,  a 
personalized  cover  letter  and  endorsements,  and  a  check  for  $20. 
About  a  week  after  the  initial  mailing,  respondents  receive  a  short 
reminder.  Four  weeks  after  the  initial  mailing,  nonrespondents  re- 
ceive a  replacement  survey  packet  and  follow-up  telephone  calls. 
Personal  computer  software  and  optical  scanning  equipment  are  used 
to  log  returns  and  track  survey  progress.  Completed  surveys  are 
validated  and  edited  for  data  entry. 

Method  3:  Two-Round  Mail  Survey.  This  methodology  is  an 
adaptation  of  the  model  used  in  Harvard's  Phase  III  survey,  which 
involved  a  two-round  mail  survey  of  small  panels  in  each  specialty, 
with  telephone  follow-up.  Because  of  the  anticipated  reduction  in 
variance  of  response  in  the  second  round,  a  sample  size  of  50  com- 
pleted (both  rounds)  protocols  per  survey  instead  of  100  is  required. 
In  the  first  round  of  the  survey.  Method  3  is  identical  to  Method  2. 
Following  initial  analysis  of  the  Round  1  survey,  a  second  survey  is 
distributed.  Respondents  to  the  second  round  receive  an  additional 
incentive  payment  of  $20.  Follow-up  to  the  initial  mailing  of  the  sec- 
ond-round survey  is  identical  to  that  described  for  the  single-round 
survey  of  Method  2. 

Method  4:     One-Round  Mail  Survey  with  Panel  Follow-Up. 

Method  4  involves  the  convening  of  three  separate  panels  of  13 
physicians  drawn  from  the  specialties  of  interest.  Following  the  selec- 
tion and  enrollment  of  panel  members,  each  panel  member  rates  200 
different  services.  After  initial  analysis  of  the  data  from  the  mail 
survey,  the  panels  convene  for  a  one-day  meeting  to  rerate  all  200 


2^he  Harvard  study  interviews  lasted  no  more  than  40  minutes;  asking  any  more 
time  of  a  physician  will  drastically  increase  the  refusal  rate. 

■^^As  an  alternative,  the  instrument  could  also  be  set  up  for  computer-assisted  in- 
terviewing. However,  the  cost  of  programming  12  different  instruments  for  a  relatively 
small  sample  argues  for  the  paper-pencil  option. 
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services  and  reconcile  discrepancies  from  the  first  round.  To  facilitate 
this  process,  panelists  receive  the  results  of  this  first  round  before  the 
panel  meeting.  Panelists  receive  an  honorarium  of  $500  for  their  par- 
ticipation. 

Common  Features 

Several  features  are  common  to  these  data-collection  methods: 

1.  Sample.  Samples  for  Methods  1,  2,  and  3  are  drawn  from  the 
AMA  Masterfile  from  specialty  groups  or  from  the  UPIN/historical 
files.  Panel  members  for  Method  4  are  selected  from  a  similar  file  by 
an  as  yet  undetermined  basis.  The  file  is  assumed  to  have  current 
addresses  and  telephone  numbers  and  requires  tracking  and  updating 
of  addresses  or  telephones  at  a  rate  of  no  more  than  5  percent  of  the 
sample. 

Though  sample  sizes  differ  for  each  method,  according  to  the  pre- 
dicted completion  rate  or  method  employed,  each  method  provides 
ratings  of  600  physician  services. 

2.  Completion  Rate  and  Initial  Sample  Size.  Both  the  size  of  the 
initial  samples  and  the  expected  completion  rates  vary  by  method,  as 
indicated  in  Table  3. 

Though  the  targeted  completion  rates  are  somewhat  high  for  surveys 
of  physicians,  these  rates  are  achievable  given  the  anticipated  high 
level  of  motivation  and  the  range  of  respondent  incentives,  as  de- 
scribed below.  By  comparison,  the  Harvard  study  reported  62  and  72 
percent  completion  rates  to  the  Phase  I  and  Phase  II  telephone  sur- 

Table  3 
Completion  Rates  and  Sample  Sizes 


Completed  Protocols 

Completed  Items 

Collection 

Completion 

Initial 

Per 

Per 

Method 

Rate 

Sample 

Total 

Survey 

Total 

Survey 

Telephone 

0.69 

1740 

1200 

100 

600 

50 

1-Round  Mail 

0.74 

1620 

1200 

100 

600 

50 

2-Round  Mail 

Round  1 

0.74 

900 

667 

58 

600 

50 

Round  2 

0.90 

667 

600 

50 

600 

50 

Panel 

n/a 

n/a 

39 

13 

600 

200 
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veys,  which  did  not  involve  monetary  incentives  for  respondents.^^ 
Moreover,  others  have  reported  rates  as  high  as  78  percent  to  mailed 
surveys  of  physicians  that  featured  advance  payment  of  $20.^^ 

3.  Survey  Instruments.  Methods  1,  2,  and  3  involve  the  adminis- 
tration of  12  separate  surveys,  each  of  which  provides  RVWs  for  each 
of  50  services.  Thus,  data  for  600  separate  services  are  obtained.  In 
addition  to  these  50  items,  approximately  12  additional  items  on 
physician  and  practice  characteristics  are  collected,  for  a  total  of  62 
items  per  survey.  At  an  estimated  rate  of  two  items  per  minute  plus 
general  introduction,  the  estimated  time  required  to  complete  each 
survey  is  40  minutes,  a  figure  comparable  to  Harvard  study  interview 
times.  For  Method  4,  the  600  services  are  measured  by  having  three 
panels  each  rate  200  services. 

For  budgeting  purposes,  we  assume  that  the  instrument  development 
task  involves  selecting  measures  already  developed  in  Phases  I  or  II 
rather  than  composing  totally  new  measures.  For  all  methods,  a 
small  pretest  (n  =  25)  has  been  included  in  the  cost  estimates. 

4.  Respondent  Incentives.  To  ensure  high  completion  rates,  a  va- 
riety of  incentives  are  offered  to  respondents.  All  methods  provide 
payment  for  participation.  Though  the  $20  payment  for  Methods  1,  2, 
and  3  does  not  adequately  compensate  physicians  for  their  time,  even 
such  a  modest  payment  has  been  shown  to  have  a  significant  effect  on 
response  rates  in  surveys  of  physicians. ^^  These  three  methods  also 
feature  personalized  letters  of  endorsements  from  medical  associa- 
tions such  as  the  AMA  and  specialty  societies.  The  two  mail  survey 
methods  also  include  preliminary  calls,  and  (for  nonrespondents)  ex- 
tensive mail  and  telephone  follow-up.  Method  4  includes  many  of 
these  features,  along  with  a  lump  sum  honorarium  of  $500  to  each 
panel  member. 

Cost  Estimates 

The  estimated  costs  shown  in  Table  4  are  based  on  the  following  gen- 
eral assumptions: 


^^siao  et  al.  (1988a);  Hsiao  et  al.  (1990). 
^^See  Berry  and  Kanouse  (1987). 
^^Berry  and  Kanouse  (1987). 
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All  estimates  are  in  Current  (1991)  rates  and  dollars. 

Though  overhead  is  not  included,  labor  costs  do  include  merit  and 
fringe  benefits. 

The  estimates  are  only  for  the  data-collection  and  entry  activities 
(e.g.,  questionnaire  development;  hiring,  training,  and  supervising 
field  staff;  postage;  telephone  charges;  respondent  payments). 
They  do  not  include  the  cost  of  analysis,  report  writing,  or  overall 
project  management  at  the  principal  investigator  level.  We  as- 
sume that  the  panels  in  Method  4  are  conducted  by  the  principal 
investigators. 

Travel  costs  for  the  panel  assume  that  meetings  are  in  Chicago  and 
that  panelists  are  from  different  parts  of  the  country.  For  the  one- 
day  meetings,  we  assume  two  days/nights  of  per  diem.  Travel  costs 
for  two  investigators  are  included  in  the  estimates. 

Table  4 

Summary  of  Estimated  Cost  of  Data  Collection 
(in  1991  dollars) 


Collection 

Total 

No.  of 

Cost  per 

Cost  per  Rated 

Method 

Cost* 

Completes 

Complete^ 

Service^ 

Telephone 

$105,000 

1200 

$87.50 

$175.00 

1-Round  Mail 

$65,500 

1200 

$54.58 

$109.17 

2-RoundMail 

$80,000 

1267<^ 

$63.14 

$133.33 

Panel 

$88,000 

n/a 

n/a 

$146.67 

^Total  cost  of  data  collection  includes  all  field  activities  (e.g.,  interview- 
ing, survey  distribution,  data  reduction),  supervision,  management,  and  in- 
strument/materials development. 

^Cost  per  complete  is  derived  by  dividing  the  total  cost  of  data  collection 
by  the  number  of  completed  cases.  (This  calculation  is  not  applicable  to  the 
panel-rating  methodology.) 

^Cost  per  service  is  derived  by  dividing  the  total  cost  of  data  collection  by 
the  600  rated  services. 

^667  completes  for  the  first  roimd  and  600  completes  for  the  second 
round. 


CONCLUSION 

The  results  of  our  examination  of  how  to  obtain  RVW  estimates  for 
revising  the  RBRVS  indicate  that  methods  that  permit  the  raters  to 
interact  with  each  other  while  rating  services  are  probably  preferable 
to  methods  that  do  not.  The  differences  between  group-based  meth- 
ods and  individually  based  methods  shown  in  the  Harvard  study  do 
not  automatically  lead  to  the  conclusion  that  group-based  methods 
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are  flawed,  only  that  the  two  types  of  method  lead  to  nonequivalent 
results.  Our  survey  of  the  social  psychological  literature  suggests 
that  the  group-based  methods  produce  values  more  indicative  of  the 
respondents'  true  judgments  than  do  the  individual-based  methods. 

The  cost  estimates  of  the  various  data-collection  methods  show  that 
the  panel  (Method  4)  costs  only  10  percent  more  than  the  2-round 
mail  survey  (Method  3),  a  difference  we  regard  as  small  enough  so 
that  relative  costs  of  the  two  group-based  methods  need  not  be  a  ma- 
jor consideration  in  choosing  between  them.  We  recommend  the 
panel  method  over  the  Delphi  because  it  provides  more  information 
for  the  participants. 

If  an  individually  based  method  continues  to  be  the  instrument  of 
choice,  the  Harvard  study  results  show  that  mail  surveys  and  tele- 
phone surveys  produce  equivalent  results.  Our  cost  analysis  shows 
that  a  single  mail  survey  is  about  five-eighths  of  the  cost  of  a  tele- 
phone survey,  a  savings  that  clearly  makes  it  the  preferred  method. 


4.  LINKAGE 


This  section  examines  in  some  detail  the  Unkage  procedure  used  by 
the  Harvard  study  group.  We  employed  our  understanding  of  linkage 
to  attempt  to  replicate  the  Phase  II  results.  On  the  basis  of  consider- 
ations that  arose  during  the  replication  effort,  we  constructed  an  al- 
ternative linkage  procedure  using  a  "perturbation  minimization"  con- 
cept of  linking  the  diverse  specialty  surveys  to  a  common  scale.  We 
designed  this  alternative  not  as  a  definitive  replacement  to  the 
Harvard  linkage  technique  but  rather  as  a  different  method  based  on 
slightly  dissimilar  but  equally  justifiable  assumptions.  The  perturba- 
tion minimization  technique  involves  both  a  reconsideration  of  the 
definition  of  a  link  and  a  modified  optimization  procedure  that  can 
adjust  values  within  each  specialty  survey.  We  close  with  a  discus- 
sion of  the  implications  of  the  exploration  of  this  alternative  ap- 
proach. 

HARVARD  LINKAGE  PROCEDURE 

The  services  to  be  linked  across  specialties  were  chosen  in  a  series  of 
steps  :^ 

1.  Technical  consulting  groups  and  the  project  team  developed  lists  of 
potential  links  from  the  services  included  in  all  of  the  surveys. 

2.  The  cross-specialty  panel  identified  same  services  (the  process,  the 
time,  and  the  type  of  patient  were  essentially  the  same)  and  equiv- 
alent services  (intra- service  work  essentially  the  same  and  in  the 
same  service  category)  from  the  potential  links. 

3.  Potential  links  that  differed  by  more  than  25  percent  in  terms  of 
intra-service  time  were  discarded. 

4.  Services  were  classified  into  service  and  setting  categories  and  ad- 
ditional potential  links  were  identified  from  these  clusters.  The 
cross-specialty  panel  chose  further  links  from  this  list. 

This  method  yielded  275  links  of  paired  services. 

The  goal  of  the  linkage  procedure  is  to  align  the  individual  specialty 
scales  of  work  to  a  common  scale  so  that  services  from  different  spe- 


iBraunetal.  (1988b). 
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cialties  with  the  same  values  on  the  common  scale  have  the  same 
level  of  work.  Three  key  assumptions  were  made  in  this  alignment: 

1.  The  sampling  of  services  in  the  various  surveys  is  representative  of 
all  the  services  provided  by  a  specialty,  so  that  the  mean  ratings  of 
work  from  a  specialty  survey  approximate  the  mean  rating  for 
work  for  all  services  performed  by  that  specialty. 

2.  When  the  services  in  different  specialties  are  judged  to  be  the  same 
or  equivalent,  they  involve  nearly  the  same  amounts  of  work. 

3.  Each  specialty's  scale  as  a  whole  is  unchanged  after  alignment  so 
that  the  ratios  of  all  the  services  within  the  specialty  remain  the 
same. 

The  first  assumption  translates  into  the  use  of  physician-level  mean 
ratings.  The  second  assumption  says  that  the  chosen  links  are  rea- 
sonable. The  third  assumption  means  that  the  scales  will  be  aligned 
by  shifting  them  relative  to  each  other.  The  individual  scales  are  not 
rescaled  internally  first  nor  are  individual  services  within  a  specialty 
shifted  different  amounts.  A  specialty  scale  stays  rigid  as  it  is  shifted 
relative  to  the  other  scales. 

Let  d{j*  be  the  adjusted  average  of  the  physician-level  logarithms  of 
the  work  ratings,  where  the  average  is  taken  over  the  approximately 
100  physicians  surveyed  for  a  specific  service  i  in  specialty  j.  This  av- 
erage was  calculated  in  the  Harvard  study  using  the  estimation-max- 
imization procedure,^  as  discussed  in  Section  2. 

The  adjusted  d'y*  are  centered  within  specialties  so  that  they  have  a 
mean  zero: 

djj  =  djj*  -  d*j    .  (1) 

These  d'y  are  on  the  logarithmic  scale  and  are  the  differences  fi*om 
the  average  service  within  a  specialty.  All  the  specialty- specific 
scales  are  to  be  translated  to  a  common  scale  and  the  relationships 
between  the  services  within  a  specialty  dj.  should  remain  the  same. 
The  linkage  procedure  accomplishes  this  outcome  by  shifting  all  ser- 
vices within  a  specialty  by  a  fixed  distance  b.  In  other  words,  bj  is  the 
position  the  specialty- specific  origin  is  shifted  to  on  the  common  scale. 
The  location  of  any  service  on  the  common  scale  is 


^Dempster,  Laird,  and  Rubin  (1977). 
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On  the  nonlogarithmic  scale,  10  J  is  the  multipHcative  factor  that 
changes  the  specialty-specific  work  units  into  common-scale  work 
units.  If  the  specialty  has  smaller  units  than  average,  then  h  <  0;  if 
the  specialty  has  larger  units  than  average,  then  bj  >  0.  An  analogy  is 
if  the  common  scale  is  feet  and  radiologists  measure  in  inches  and 
surgeons  in  yards,  the  radiology  constant  10  J  =  1/12  and  the  surgical 
constant  10  J  =  3. 

Linked  services  should  be  as  close  as  possible  on  the  common  scale. 
Define  the  optimal  location  of  a  linked  service  i  on  the  common  scale 
to  be  ai.  Using  the  specialty-specific  shift  described  above,  the  devia- 
tion of  any  linked  service  on  the  common  scale  from  the  optimal  loca- 
tion on  the  common  scale  is 

dij  +  bj-a^. 

The  optimal  parameters  a^  and  bj  are  those  that  minimize  the  set  of 
deviations  over  all  linked  services.  The  a^  and  bj  are  estimated  via 
weighted  least  squares.  The  reason  for  taking  logarithms  of  the  rat- 
ings is  that  the  error  distribution  in  the  linear  model  is  closer  to  nor- 
mal and  thus  the  usual  distribution  theory  and  associated  inference 
tests  may  be  used  when  the  regression  model  is  examined. 

The  observations  are  weighted  inversely  to  their  estimated  variances 
s^ij.  That  is,  deviations  with  small  variances  will  have  more  effect  on 
the  fitting  as  their  values  are  better  known. 

Generally,  one  does  not  use  weights  in  least-squares  fitting  because  a 
unique  estimate  of  variance  at  each  observation  is  not  known. 
Usually,  the  variance  is  assumed  to  be  the  same  for  all  observations. 
However,  because  the  deviations  are  actually  the  averages  taken  over 
the  approximately  100  physicians  that  were  sampled,  the  variance  for 
each  specific  service  may  be  estimated  by  the  standard  error  of  the 
mean.^ 

In  addition  to  weighting  by  the  variances,  the  Harvard  study  used  the 
iterative  Tukey  biweight  procedure.^  The  first  step  in  this  procedure 
is  a  weighted  least  squares  using  the  inverses  of  variance  esti- 


The  estimated  variance  for  standard  services  s^j^  was  calculated  separately,  as 
described  in  Section  2. 

hosteller  and  Tukey  (1977). 
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mates  as  weights.  The  residuals  are  used  to  reweight  the  observa- 
tions and  the  new  weighted  least-squares  problem  is  solved.  This 
procedure  is  iterated  until  the  biweights  converge.^  The  idea  behind 
this  procedure  is  to  give  observations  with  large  residuals  less  weight, 
as  they  are  assumed  to  be  outliers  which  have  been  poorly  surveyed. 

Although  this  procedure  is  statistically  correct,  it  has  an  unantici- 
pated effect  if  the  biweight  becomes  zero:  The  link  defined  by  the 
panel  is  in  essence  overruled  and  does  not  enter  into  the  estimation  of 
bj.  These  "statistically  eliminated"  links  should  be  scrutinized  before 
accepting  the  results  of  the  analysis,  but  such  a  scrutiny  appears  not 
to  have  been  done. 

The  optimization  problem  is: 

minai,bjE[wy(di^  +  bj-a^)2ys2y  (2) 

where  the  summation  is  taken  over  all  specialties  j  and,  for  each  spe- 
cialty, over  all  linked  services  i.  The  minimization  is  done  under  the 
constraints  that  linked  services  have  the  same  a^  and  the  mean  of  the 
bj  values  is  an  arbitrary  constant,  taken  in  Phase  II  to  be  2.025  to 
compare  Phase  I  and  Phase  II  results.  The  Tukey  biweights  are  w^. 

The  275  links  resulted  in  550  observations  in  the  regression,  as  every 
link  constitutes  two  observations,  one  for  each  link  direction.  The 
number  of  link  location  parameters  a^  is  275,  one  for  every  link.  The 
number  of  specialty  shift  parameters  h  is  39,  one  for  each  specialty 
surveyed  in  each  phase  of  the  Harvard  study .^ 

After  the  fitting  has  been  done,  the  logarithm  of  the  work  of  specific 
service  is  estimated  as  d^j  +  bj,  where  the  d^j  is  observed  and  the  bj  is 
estimated.  The  a^  does  not  appear  in  this  equation.  The  linked  ser- 
vices have  just  pulled  each  specialty's  scale  into  alignment  and  the  ef- 
fect is  seen  in  the  locations  of  all  specialty  j's  surveyed  services. 
Given  the  above  definitions,  the  position  of  the  particular  service  on 
the  common  scale  is  related  to  the  position  of  the  particular  service  on 
the  specialty-specific  scale  by  a  shift  of  bj. 


In  the  Phase  11  Harvard  study  and  our  own  calculations  below,  three  Tukey  bi- 
weight steps  were  taken. 

"Although  the  number  of  distinct  surveyed  specialties  was  32,  seven  specialties 
were  surveyed  during  both  phases.  Two  specialty  shift  parameters,  one  for  each  sur- 
vey, were  calculated  for  these  specialties.  For  ease  of  discussion,  we  will  throughout 
the  rest  of  this  section  call  the  39  survey  sessions  "specialties." 
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DUPLICATING  THE  HARVARD  LINKAGE  PROCEDURE 

To  fully  understand  the  Harvard  linkage  procedure,  we  attempted  to 
replicate  it.  Because  only  the  means  and  standard  deviations  of  ser- 
vices over  physicians  were  available  to  us,  we  could  not  use,  much 
less  validate,  the  estimation-maximization  averaging  of  the  physician- 
level  data.  That  is,  our  raw  data  were  the  d-j*  values."^ 

In  the  previous  subsection,  the  d-j*  were  discussed  as  if  all  were  the 
same  type  of  work.  In  actual  fact.  Phase  II  links  could  be  among 
three  different  types  of  work:  intra-service  work,  total  work,  and 
work  per  unit  time  (intensity).  This  resulted  in  four  types  of  links 
used  in  Phase  II: 

•  Links  from  intra-service  work  to  intra-service  work, 

•  Links  from  total  work  to  total  work, 

•  Links  from  total  work  to  intra-service  work,  and 

•  Links  from  intensity  to  intensity. 

For  1,126  surveyed  services,  data  were  available  for  intra-service 
work,  total  work,  and  intra-service  time  physician-level  means,  and 
associated  standard  errors  and  numbers  of  physicians  surveyed  for 
each  type  of  work.  One  specialty,  ophthalmology  in  Phase  I,  did  not 
have  estimated  standard  errors;  we  instead  used  the  average  stan- 
dard error  for  all  ophthalmological  services  in  Phase  II  as  a  surrogate. 
After  some  exploration,  we  discovered  that  the  Harvard  study  vari- 
ance estimates  were  multiplied  by  the  number  of  physicians  surveyed 
for  each  service.  We  did  likewise,  although  this  multiplication  gives 
less  weight  in  the  regression  to  those  services  that  were  more  widely 
surveyed.  We  will  refer  to  these  weighted  variances  as  the  Harvard 
variances  throughout  the  rest  of  this  section. 

We  wanted  to  compare  our  estimated  specialty  shift  parameters  bj 
with  those  reported  by  Harvard.  Thus,  we  needed  to  first  center  the 
work  logarithms  djj*  to  have  mean  zero  as  in  Equation  (1).  In  gen- 
eral, we  subtracted  the  intra-service  work  mean  within  specialty  from 
both  the  intra-service  work  values  and  the  total  work  values.  For 
three   specialties,^  we  used  the  total  work  specialty  mean,  as  intra- 


'Recall  that  these  d[j  values  are  logarithms.  Throughout  this  subsection,  all  calcu- 
lations will  be  based  on  the  logarithm  of  work,  not  on  work  itself.  To  help  the  text  flow 
more  smoothly,  we  will  omit  this  specification  most  of  the  time. 

Nuclear  medicine,  radiation  oncology,  and  pathology/Phase  II. 
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service  work  values  were  not  reported.  For  intensity  links,  we  di- 
vided the  centered  intra-service  work  value  by  the  intra-service  time 
reported. 

Given  these  centered  work  values,  we  first  fit  the  model  using 
weighted  least  squares,  with  the  weights  equal  to  the  inverse  of  the 
Harvard  variances.  Then  we  took  three  Tukey  steps,  iterating  new 
weight  values  at  each  step.  Table  5  compares  the  Harvard  reported 
value  from  the  Phase  H  final  report  and  our  result,  using  the  same 
degree  of  accuracy  as  Harvard  reported.  The  final  column  of  the  table 
expresses  the  differences  between  the  two  linkage  calculations  in 
terms  of  the  percentage  change  in  specialty  work  values.  That  is,  if  A. 
is  the  difference  on  the  logarithmic  scale  between  our  result  and  the 
Harvard  result,  then  100[10^J  -  1]  expresses  this  difference  in  terms 
of  a  percentage  increase  or  decrease  in  the  work  values  for  a  specialty 
j.  This  value,  which  we  call  "percent  change,"  is  given  in  the  last  col- 
umn of  Table  5.  This  percentage  difference  is  a  measure  of  the 
change  that  would  result  in  physician  work  (and  hence  payment)  in 
adopting  an  alternative  linkage  procedure.  For  example,  in  Table  5, 
the  A  for  plastic  surgery  is  0.2405.  This  translates  to  a  percentage 
change  of  5.69,  which  means  that  if  our  calculations  were  adopted  in- 
stead of  those  of  the  Harvard  group,  the  work  value  of  all  services 
measured  and  extrapolated  from  the  plastic  surgery  survey  would  be 
increased  by  5.69  percent. 

In  general,  our  results  are  within  10  percent  of  the  Harvard  results 
except  for  the  three  specialties,  dermatology/II,  ophthalmology/I,  and 
orthopedic  surgery/II.  These  differences  could  be  due  to  the  problems 
we  had  duplicating  the  Harvard  centering  procedure  and  the  lack  of 
standard  errors  for  ophthalmology/I. 

A  NEW  LOOK  AT  LINKAGE 

Our  examination  of  the  Harvard  linkage  procedure  revealed  certain 
troublesome  choices  and  simplifying  assumptions.  We  therefore  con- 
sider an  alternative  to  their  methodology,  partly  to  determine  how 
sensitive  the  results  were  to  the  linkage  approach.  In  particular,  our 
proposed  alternative  takes  into  account  the  following  philosophical 
tenets  that  we  consider  important: 

•   If  the  linked  services  are  stated  to  have  equivalent  amounts  of 
work,  then  the  RVW  scale  should  reflect  this  equivalence,  and  the 
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Table  5 

Comparison  of  Harvard  Phase  II  Linkage  and 
Our  Replication  Attempt 


Percent 

Specialty 

Orig.  bj 

RANDbj 

^ 

Change 

Allergy/immunology 

1.8404 

1.8590 

0.0186 

4.36 

Anesthesiology 

2.1921 

2.2001 

0.0080 

1.85 

Cardiology 

2.1280 

2.1213 

-0.0068 

-1.54 

Dermatology/I 

1.6385 

1.6463 

0.0078 

1.80 

Dermatology/II 

1.9775 

1.8737 

-0.1038 

-21.27 

Emergency  medicine 

1.9367 

1.9405 

0.0038 

0.87 

Family  practice 

1.7794 

1.7873 

0.0078 

1.82 

Gastroenterology 

2.1589 

2.1860 

0.0270 

6.43 

General  surgery/I 

2.2061 

2.2309 

0.0248 

5.86 

General  surgery/II 

2.4432 

2.4603 

0.0170 

4.00 

Hematology/oncology 

1.9284 

1.9163 

-O.0122 

-2.76 

Infectious  diseases 

1.9255 

1.9436 

0.0180 

4.24 

Internal  medicine/I 

1.7579 

1.7633 

0.0054 

1.24 

Internal  medicine/II 

1.7988 

1.8054 

0.0066 

1.52 

Maxillofacial  surgery 

2.2449 

2.2537 

0.0088 

2.04 

Nephrology 

2.0532 

2.0456 

-0.0076 

-1.75 

Neurology 

1.8253 

1.8005 

-0.0248 

-5.56 

Neurosurgery 

2.7556 

2.7373 

-0.0184 

-4.14 

Nuclear  medicine 

1.8806 

1.8461 

-0.0346 

-7.65 

Obstetrics/gynecology 

2.0722 

2.0855 

0.0132 

3.10 

Ophthalmology/I 

2.1181 

2.1433 

0.0252 

5.96 

Ophthalmology/II 

2.1014 

2.1748 

0.0734 

18.40 

Orthopedic  surgery/I 

2.0714 

2.0543 

^.0172 

-3.87 

Orthopedic  surgery/II 

2.4918 

2.3484 

-0.1434 

-28.13 

Osteopathy 

1.8000 

1.7641 

-O.0360 

-7.94 

Otolaryngology 

2.2746 

2.2932 

0.0186 

4.36 

Pathology/I 

1.6191 

1.6460 

0.0268 

6.38 

Pathology/n 

1.7571 

1.7124 

-0.0448 

-9.79 

Pediatrics 

1.6741 

1.6749 

0.0008 

0.17 

Physical  and  i-ehab. 

1.8874 

1.9246 

0.0372 

8.93 

Plastic  surgery 

2.3746 

2.3987 

0.0240 

5.69 

Pulmonary  medicine 

1.8895 

1.8913 

0.0018 

0.40 

Psychiatry/I 

2.1026 

2.1205 

0.0178 

4.20 

Psychiatry/n 

2.1094 

2.0736 

-0.0358 

-7.92 

Radiology 

1.6811 

1.7172 

0.0360 

8.66 

Rheumatology 

1.7105 

1.7399 

0.0294 

6.99 

Radiation  oncology 

2.0365 

2.0227 

-0.0138 

-3.14 

Thoracic  surgery 

2.4752 

2.4986 

0.0234 

5.52 

Urology 

2.2467 

2.2640 

0.0173 

4.05 

NOTE:   A  roman  numeral  following  a  specialty  refers  to  the  phase  of 
the  Harvard  study. 
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a^  values  calculated  in  the  linkage  procedure  should  be  the  common 
scale  values.  In  the  current  Harvard  method,  the  a^  are  ignored,  so 
linked  services,  though  they  are  claimed  to  be  equivalent,  can  and 
do  have  different  work  values.^ 

•  Again,  if  equivalence  means  equivalence,  then  links  should  be 
transitive.  Thus,  if  service  A  entails  the  same  work  as  service  B 
and  service  B  entails  the  same  work  as  service  C,  then  services  A 
and  C  should  also  be  equivalent  and  linked.  The  Harvard  group 
adopted  a  fuzzier  definition  of  equivalence  and  did  not  assume 
transitivity. 

•  The  Medicare  Fee  Schedule  mandates  a  single  RVW  for  each  CPT 
code.  This  is  an  implicit  statement  that  the  same  CPT  code  repre- 
sents, on  average,  the  same  work  across  specialties  and  constitutes 
an  implicit  link.  These  links  should  also  be  used  to  form  the  com- 
mon scale.  ^^ 

•  Finally,  if  the  a^  values  resulting  from  the  linkage  calculations  are 
taken  to  be  work  values  on  the  common  scale,  the  surveyed  work 
values  not  linked  should  maintain  as  close  a  relationship  as  possi- 
ble to  the  new  a^  values  as  they  did  to  the  originally  surveyed  d^j 
values. 

These  principles,  taken  together,  result  in  what  we  call  the  perturba- 
tion minimization  procedure  to  express  work  on  a  common  scale.  The 
primary  difference  between  this  new  methodology  and  the  Harvard 
approach  is  that  the  former  incorporates  the  equivalence  of  services 
fully  and  directly  into  the  optimization  algorithm.  Equivalence  is  in- 
corporated fully  by  adding  the  link  transitivity  requirement  and  same 
CPT  code  links.  Equivalence  is  incorporated  directly  by  the  fact  that 
our  procedure  yields  the  final  RVWs,  thereby  eliminating  the  addi- 
tional averaging  step  needed  by  the  Harvard  group  after  their  opti- 
mization.   By  eliminating  the  averaging,  however,  we  cannot  main- 


^As  we  explained  in  Section  2,  this  necessitates  the  additional  step  of  deriving  a 
common  work  value  for  different  vignettes  carrying  a  common  CPT  code,  both  within  a 
specialty  and  across  specialties. 

^^A  problem  arises  because  multiple  vignettes  within  a  specialty  can  have  the  same 
CPT  code.  For  example,  a  lai-ge  number  of  "50  minute  hour"  office  visits  in  psychiatry 
are  coded  90844,  even  it  is  obvious  that  different  types  of  visits  involve  different 
amounts  of  work.  But  the  intent  of  the  RBRVS  is  not  to  provide  work  value  for  all 
physician  services  but  to  provide  a  valid  average  work  value  for  all  physician  sei'vices 
billed  under  a  particular  CPT  code.  Thus,  within  a  specialty,  an  average  of  surveyed 
values  for  common  CPT  codes  is  an  estimate  of  work,  which  can  be  linked  to  similar 
averages  of  other  specialties. 
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tain  the  surveyed  relationships  between  service  work  values  within  a 
specialty.  In  effect,  we  posit  that  the  inter-specialty  equivalencies 
defined  by  links  are  free  of  measurement  error,  and  therefore  the  sur- 
veyed intra-specialty  relationships  must  be  adjusted  relative  to  the 
links. ^^  Our  new  procedure  consists  of  two  discrete  steps,  generating 
a  different  set  of  links  and  determining  the  optimal  work  values  via  a 
new  linkage  procedure.  After  describing  this  new  methodology  and 
its  results,  we  discuss  their  implications. 

Generate  a  Different  Set  of  Links 

The  first  step  is  to  define  the  set  of  services  to  be  used  in  the  links. 
We  began  with  our  link  set  equal  to  the  Harvard  link  set.  We  then 
changed  this  link  set  in  three  ways. 

1.  Drop  Intensity  Links.  The  intensity  links  seem  contradictory 
given  that  the  Harvard  approach  justified  magnitude  estimation  be- 
cause work  was  a  varying  and  not  necessarily  linear  function  of  time, 
intensity,  technical  ability,  and  mental  judgment.  Thus,  we  deleted 
these  intensity  links  from  our  link  set,  leaving  all  Harvard  intra-ser- 
vice,  total,  and  mixed  links.  This  resulted  in  the  loss  of  32  of  the  275 
original  Harvard  links. 

2.  Generate  Common  CPT  Code  Links.  We  expanded  our  link  set 
so  that  all  same  CPT  code  surveyed  services  were  linked;  thus,  they 
were  linked  across  specialties.  In  several  cases,  a  CPT  code  was  sur- 
veyed more  than  once  per  specialty,  albeit  with  different  vignettes. 
We  did  not  link  these  same  CPT  code  services  within  a  specialty. 
Instead,  we  first  formed  a  new  "service"  whose  work  value  was  the 
weighted  fl^y  number  of  survey  respondents)  average  of  the  work  val- 
ues of  services  within  that  specialty  with  the  same  CPT  code.  The 
standard  errors  were  calculated  in  the  usual  manner  for  a  weighted 
mean.  This  averaging  produced  83  new  services,  bringing  the  total 
number  of  services  to  1,209. 

Any  original  Harvard  links  between  two  services  with  the  same  CPT 
code  were  left  intact.  However,  we  did  not  distinguish  among  linked 
and  unlinked  services  in  creating  the  common-CPT  artificial  services. 
Thus,  two  vignettes  sharing  a  common  CPT  code  but  from  different 
specialties  might  be  used  twice  in  the  linkage  procedure:  once  if  they 


^^We  do  not  defend  this  assumption  as  true  in  an  absolute  sense  but  offer  it  as  part 
of  a  set  of  assumptions  with  as  much  claim  to  validity  as  the  set  of  assimiptions 
adopted  by  the  Harvard  group. 
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were  paired  as  an  original  Harvard  vignette  link  and  once  as  part  of 
their  contributions  to  a  multi-vignette  common-CPT  link. 

3.  Make  Linkages  Transitive.  The  Harvard  links  were  not  neces- 
sarily transitive,  a  property  that  we  believed  essential  to  ensuring  the 
fairness  of  the  procedure  and  its  acceptability.  Therefore,  we  ex- 
panded links  to  create  transitive  link  subsets  and  defined  the  mem- 
bers of  each  associated  orbit  to  consist  of  the  interlinked  services,  us- 
ing a  term  borrowed  from  the  modem  algebra  literature.  ^^ 

An  orbit  is  an  inclusive  set  of  services  that  are  linked  together  transi- 
tively. For  example,  if  service  A  is  linked  to  service  B,  and  service  B 
is  linked  to  service  C,  then  service  A  must  be  linked  to  service  C  to 
ensure  transitivity.  If  services  A,  B,  and  C  are  linked  thus  and  are 
not  linked  to  any  other  services,  then  (A,  B,  C)  forms  an  orbit  with 
three  associated  links  between  its  member  services.  If  any  of  the  ser- 
vices are  linked  to  other  services,  new  links  are  added  to  ensure 
transitivity  and  the  orbit  becomes  larger.  For  example,  if  A  is  also 
linked  to  D,  then  links  from  B  to  D  and  from  C  to  D  are  added  and  the 
orbit  consists  of  (A,  B,  C,  D)  with  six  associated  links. 

Each  orbit  has  an  associated  a^,  which  we  call  an  orbit  location  pa- 
rameter. Implicitly,  we  changed  the  Harvard  optimization  constraint 
in  Equation  (2)  so  that  all  members  of  an  orbit  o  must  have  the  orbit 
location  parameter  a^. 

4.  Drop  EM  Service  Links.  After  averaging,  forming  the  new  same 
CPT  code  links  and  making  all  links  transitive,  we  had  172  orbits 
made  up  of  13,102  links.  The  main  reason  for  this  large  number  of 
links  was  that  several  CPT  codes  appear  in  almost  every  specialty. 
Given  our  same  CPT  code  link  approach,  these  specialties  all  become 
linked,  producing  several  large  orbits.  If  a  Harvard  vignette  link 
happens  to  fall  into  the  same  orbit,  the  orbit  must  expand  to  include 
that  service  and  all  services  linked  to  it  to  satisfy  the  transitivity  re- 
quirement. An  orbit  can  encompass  many  services  through  this  ex- 
pansion process.  In  particular,  the  three  largest  orbits  had  approxi- 
mately 8,000,  3,000,  and  1,200  links. 

The  CPT  codes  that  produce  these  large  orbits  are  EM  codes  (CPTs 
90000  through  90699),  which  tend  to  appear  in  almost  all  specialties. 
Since  the  influence  of  a  particular  code  on  the  optimization  results  in- 
creases monotonically  with  the  number  of  links  the  service  appears 
in,  these  EM  codes  have  a  large  influence  and  tend  to  swamp  the  ef- 
fect of  other  codes  in  the  linkage  process.     General  dissatisfaction 


l^Gilbert  (1976). 
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with  the  current  use  of  CPT  codes  for  EM  services  has  been  ex- 
pressed, ^^  and  their  surveyed  work  values  are  considered  to  be  un- 
certain. Given  this  uncertainty,  we  decided  to  drop  EM  codes  from 
consideration  for  common-CPT  linkage  so  that  they  would  not  be  al- 
lowed to  have  a  major  effect  on  RBRVS  calculations.  If  an  EM  code 
was  linked  in  the  Harvard  set,  we  retained  this  vignette  link  in  our 
set.  The  resulting  new  number  of  orbits  was  208,  with  638  links. 

Table  6  shows  the  vignette  and  common-CPT  code  links  by  specialty. 
Each  link  appears  twice  in  this  table,  since  it  is  composed  of  two  ser- 
vices in  different  specialties.  Thus,  the  total  number  of  entries  in  this 
table  is  1,276.  The  number  of  specialties  increases  from  39  to  42  be- 
cause of  the  addition  of  three  subspecialties,  ophthalmology/corneal 
procedures,  ophthalmology/glaucoma  procedures,  and  child  psychia- 
try, which  were  separately  surveyed  but  did  not  appear  in  the  vi- 
gnette link  set.  Because  these  services  have  some  common-CPT  links, 
they  now  appear  in  the  linkage  procedure.  Table  6  shows  that  some 
specialties,  for  example  ophthalmology/II,  have  large  increases  in  the 
number  of  links. 

Before  describing  the  new  linkage  procedure,  evaluation  of  the  effect 
of  the  new  expanded,  transitive  link  set  alone  on  the  results  is  war- 
ranted. Table  7  shows  the  specialty  shift  parameters  that  result  from 
the  Harvard  linkage  procedure  using  the  new  link  set.  We  compare 
these  results  to  our  own  replication  of  the  Harvard  results  rather 
than  the  original  Harvard  results.  This  comparison  is  made  because 
our  two  sets  of  results  are  based  on  the  same  data  assumptions  and 
thus  provide  a  fairer  comparison  of  the  consequences  of  the  new  link 
set. 

The  largest  change  is  in  anesthesiology,  which  loses  almost  75  per- 
cent. Dermatology/I,  ophthalmology^,  orthopedic  surgery/II,  and  pe- 
diatrics each  gain  over  25  percent  whereas  emergency  medicine  and 
physical  and  rehabilitative  services  suffer  decreases. 

A  New  Linkage  Procedure 

In  the  Harvard  linkage  procedure,  after  the  specialty  shift  parame- 
ters bj  and  the  orbit  location  parameters  a^  are  estimated  by  the 


^^Lasker,  Marquis,  and  Morrow  (1991);  Physician  Payment  Review  Commission 
(1991).  This  dissatisfaction  led  to  the  replacement  of  the  EM  codes  by  a  new  set 
(numbered  99200-99499)  in  the  1992  version  of  CPT. 
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Table  6 
Number  of  Vignette  and  Common-CPT  Code  Links 


Specialty 

Vignette 

Common  CPT 

Total 

Allergy/immunology 

6 

7 

13 

Anesthesiology 

5 

10 

15 

Cardiology 

7 

14 

21 

Dermatology/I 

5 

12 

17 

Dermatology/II 

9 

26 

35 

Emergency  medicine 

6 

21 

27 

Family  practice 

29 

21 

50 

Gastroenterology 

12 

15 

27 

General  surgery/I 

32 

36 

68 

General  surgery/II 

21 

16 

37 

Hematology/oncology 

12 

11 

23 

Infectious  diseases 

5 

19 

24 

Internal  medicine/I 

32 

37 

69 

Internal  medicine/II 

39 

39 

69 

Maxillofacial  surgery 

4 

4 

8 

Nephrology 

11 

8 

19 

Neurology 

12 

13 

25 

Neurosurgery 

11 

15 

27 

Nuclear  medicine 

4 

2 

6 

Obstetrics/gynecology 

11 

13 

24 

Ophthalmology/I 

11 

21 

32 

Ophthalmology/II 

17 

70 

87 

Ophthalmology/cornea 

44 

44 

Ophthalmology/glaucoma 

39 

39 

Orthopedic  surgery/I 

17 

19 

36 

Orthopedic  surgery/II 

19 

13 

32 

Osteopathy 

11 

9 

20 

Otolaryngology 

12 

12 

24 

Pathology/I 

4 

20 

24 

Pathology/n 

4 

26 

30 

Child  psychiatry 

13 

13 

Pediatrics 

8 

18 

26 

Physical  and  rehab. 

9 

22 

31 

Plastic  surgery 

17 

13 

30 

Pulmonary  medicine 

15 

33 

48 

Psychiatry/I 

10 

15 

25 

Psychiatry/n 

7 

15 

22 

Radiology 

8 

6 

14 

Rheumatology 

15 

16 

31 

Radiation  oncology 

6 

5 

11 

Thoracic  surgery 

10 

20 

30 

Urology 

13 

11 

24 

NOTE:  A  roman  numeral  following  a  specialty  refers  to  the 
phase  of  the  Harvard  study.  In  Phase  II,  separate  sui-veys  were 
conducted  for  general  ophthalmology,  procedures  involving  the 
cornea,  and  procedures  related  to  glaucoma. 
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Table  7 

ComparLson  of  the  New  Link  Set  Results  to  the  Harvard 
Link  Set  Results 


New  Link 

Percent 

Specialty 

Setbj 

RANDbj 

Ai 

Change 

Allergy/immunology 

1.9468 

1.8590 

0.0878 

22.40 

Anesthesiology 

1.6090 

2.2001 

-0.5911 

-74.36 

Cardiology 

2.1035 

2.1212 

-0.0178 

^.01 

Dermatology/I 

1.7718 

1.6463 

0.1255 

33.51 

Dermatology/II 

1.8882 

1.8737 

0.0145 

3.41 

Emergency  medicine 

1.8537 

1.9405 

-0.0868 

-18.11 

Family  practice 

1.7603 

1.7872 

-O.0270 

-€.02 

Gastroenterology 

2.1766 

2.1860 

-0.0094 

-2.14 

General  surgevy/I 

2.2253 

2.2309 

-0.0055 

-1.26 

General  surgery/II 

2.4464 

2.4603 

-0.0139 

-3.15 

Hematology/oncology 

1.9180 

1.9163 

0.0018 

0.41 

Infectious  diseases 

1.8883 

1.9435 

-0.0552 

-11.94 

Internal  medicine/I 

1.7438 

1.7633 

-0.0195 

^.39 

Internal  medicine/II 

1.7833 

1.8054 

-0.0221 

-4.96 

Maxillofacial  surgery 

2.2715 

2.2537 

0.0178 

4.19 

Nephrology 

2.0381 

2.0456 

-0.0075 

-1.71 

Neurology 

1.7910 

1.8005 

-0.0095 

-2.16 

Neurosurgery 

2.7877 

2.7372 

0.0504 

12.31 

Nuclear  medicine 

1.8610 

1.8461 

0.0150 

3.50 

Obstetrics/gynecology 

2.0358 

2.0855 

-0.0497 

-10.82 

Ophthalmology/I 

2.2457 

2.1432 

0.1024 

26.60 

Ophthalmology/II 

2.1416 

2.1748 

-0.0332 

-7.36 

Ophthalmology/cornea 

2.1768 

Ophthalmology/glaucoma 

2.0546 

Orthopedic  surgery/I 

2.1231 

2.0542 

0.0689 

17.19 

Orthopedic  surgery/II 

2.4611 

2.3483 

0.1128 

29.65 

Osteopathy 

1.7701 

1.7640 

0.0061 

1.40 

Otolaryngology 

2.2493 

2.2931 

-0.0438 

-9.60 

Pathology/I 

1.7207 

1.6459 

0.0748 

18.80 

Pathology/n 

1.7009 

1.7124 

-0.0114 

-2.59 

Child  psychiatry 

2.0906 

Pediatrics 

1.7734 

1.6748 

0.0986 

25.48 

Physical  and  rehab. 

1.8512 

1.9246 

-0.0733 

-15.53 

Plastic  surgery 

2.3496 

2.3987 

-0.0491 

-10.68 

Pulmonary  medicine 

1.8773 

1.8913 

-0.0140 

-3.17 

Psychiatry/I 

2.1653 

2.1205 

0.0449 

10.88 

Psychiatry/II 

2.0471 

2.0736 

-0.0265 

-5.91 

Radiology 

1.7943 

1.7171 

0.0772 

19.44 

Rheumatology 

1.7831 

1.7398 

0.0433 

10.49 

Radiation  oncology 

2.0393 

2.0226 

0.0167 

3.91 

Thoracic  surgery 

2.5000 

2.4985 

0.0015 

0.34 

Urology 

2.2223 

2.2639 

-0.0416 

-9.14 

NOTE:  A  roman  nimaeral  following  a  specialty  refers  to  the  phase  of  the 
Harvard  study.  In  Phase  II,  separate  surveys  were  conducted  for  general  oph- 
thalmology, procedures  involving  the  cornea,  and  procedures  related  to  glau- 
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least-squares  procedure,  the  location  of  an  unlinked  service  whose 
centered  specialty-specific  work  value  was  d^j  is  just  d^j  +  bj.  This  ap- 
proach maintains  the  distance  between  services  within  specialty,  that 
is,  the  intra- specialty  service  relationships  and  the  entire  specialty- 
specific  scale  is  just  shifted  as  a  body.  However,  services  linked 
across  specialties  may  not  have  the  same  work  value  after  linkage  be- 
cause the  ai  are  not  used  to  assign  linked  service  work  values  on  the 
common  scale. 

Our  proposed  alternative  requires  that  all  linked  services  within  an 
orbit  have  the  same  work  value  on  the  common  scale.  That  is,  mem- 
bers of  an  orbit  all  have  work  value  a^  At  the  same  time,  we  would 
like  the  optimization  to  ensure  that  the  distances  between  services 
within  a  specialty  stay  as  close  as  possible  to  the  original  surveyed 
distances.  In  essence,  after  redefining  the  linked  service  values  to  be 
a-  we  seek  to  adjust  the  unlinked  d^.  values  to  preserve  as  much  as 
possible  their  relationships  to  other  services  within  their  specialty. 
The  name  of  our  new  procedure,  perturbation  minimization,  describes 
this  goal. 

For  services  h  and  i  within  specialty  j,  let  the  surveyed  distance  be^^* 

Shi  j  =  dij  -  dhj  . 

Let  the  location  of  an  orbit  o  on  the  common  scale  be  ao  and  the  loca- 
tion of  services  h  and  i  be  aj^  and  a^.  Then  the  new  optimization  prob- 
lem is 

minao,ah,ai ^^^ (^-^h- \if^Aij  •  (3) 

The  first  summation  is  over  all  specialties  j  and  the  second  and  third 
are  over  all  pairs  of  services  (h,i)  within  a  particular  specialty  with  no 
double  counting  allowed.  The  objective  function  is  minimized  under 
the  constraint  that  all  member  services  of  orbit  o  have  a  fixed  value 
a^.  Each  term  is  weighted  by  the  inverse  of  its  estimated  variance 
^^hi  j'  which  is  calculated  from  the  surveyed  variances  of  the  dj.  We  do 
not  use  the  Tukey  biweight  method. 

The  key  difference  between  the  optimizations  in  Equation  (2)  and 
Equation  (3)  is  that  all  services  are  involved  in  the  new  optimization 
Equation  (3)  and  services  can  move  within  specialties.  We  look  at  dif- 


^^If  service  k  is  the  standard  service  for  the  specialty,  i.e.,  h  =  k,  then  5j^|  •=  d^-  This 
convenience  facilitates  calculating  other  differences  through  the  shortcut  of  5j^-j  = 
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ferences  not  only  between  the  standard  service  and  every  other  ser- 
vice in  a  specialty  but  also  between  all  pairs  of  services  within  a  spe- 
cialty. We  force  linked  services  to  have  equal  values  for  all  members 
of  an  orbit  and  then  try  to  minimize  the  effect  on  the  relationships 
between  services  within  specialties.  We  seek  to  minimize  the  pertur- 
bations within  specialty  scales  that  result  from  the  required  equality 
of  all  services  within  an  orbit. 

Though  Equation  (3)  remains  a  least-squares  problem,  the  number  of 
parameters  is  large  and  the  optimization  problem  could  prove  difficult 
to  solve  because  of  its  size.  A  reasonable  alternative^^  is  a  two-stage 
optimization.  The  first  stage  is  to  do  the  Harvard  optimization  of 
Equation  (2)  with  the  new  link  set  and  orbit  constraints  as  was  done 
for  Table  7.  After  this  stage,  the  work  values  of  all  members  of  the 
same  orbit  are  set  equal  to  their  associated  a^.  Then  for  each  spe- 
cialty j,  we  attempt  to  minimize  the  effect  on  the  relationships  be- 
tween services  within  the  specialty  by  solving  the  inner  optimization 
from  Equation  (3): 

ininah,ai  ^^  (ai  -a^-  \i//s^i^j 

with  the  constraint  that  all  linked  services  have  a^  =  a^,  for  the  appro- 
priate orbit  location  parameter  estimated  in  stage  one.  This  alterna- 
tive is  equivalent  philosophically  to  the  full  optimization  in  that  it 
forces  linked  services  to  have  the  same  assigned  work  value  and  it 
seeks  to  assign  work  values  that  minimize  the  distortion  of  surveyed 
distances  between  services  within  specialties  that  results  from  link- 
age. 

Results  of  the  Perturbation  Minimization  Procedure 

The  new  perturbation  minimization  procedure  does  not  require  any 
calculations  after  it  has  been  completed.  In  contrast,  the  Harvard 
procedure  requires  an  averaging  step  afterward  to  ensure  that  linked 
services  have  the  same  work  value  and  a  CPT  has  a  single  work 
value.  Thus,  the  results  of  our  new  procedure  must  be  compared  to 
Harvard's  final  reported  results.  Because  of  the  uncertainty  of  the 
EM  CPT  codes  and  our  subsequent  decision  not  to  use  them  to  gener- 
ate new  links,  we  do  not  include  them  in  our  comparisons. 


^^We  thank  Grace  Carter  for  her  considerable  assistance  in  developing  this  alter- 
native. 
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The  Harvard  values  used  for  comparison  are  the  Phase  III  results  re- 
ported in  February  1991  to  HCFA  for  surveyed  non-EM  services. 
These  work  values  and  our  new  values  were  standardized  to  have  the 
same  overall  mean  on  the  logarithmic  scale  before  comparison,  just  as 
the  different  phase  results  were  standardized  by  the  Harvard  group. 
Since  the  perturbation  minimization  procedure  may  shift  work  values 
within  a  specialty  by  varying  amounts,  comparison  must  be  made  be- 
tween our  and  Harvard's  results  for  each  service  individually.  In  con- 
trast, earlier  comparisons  in  Tables  5  and  7  were  made  by  specialty, 
as  the  specialty  scales  shift  rigidly  with  all  services  within  a  specialty 
moving  a  fixed  amount. 

The  differences  on  the  logarithmic  scale  were  assessed  graphically  via 
a  histogram  for  each  specialty.  In  general,  these  histograms  were 
unimodal  and  symmetric.  The  comparisons  are  summarized  in  Table 
8  and  in  Figure  1. 

Table  8  shows  the  average  percentage  difference  between  the  work 
value  calculated  by  the  alternative  linkage  procedure  and  that  calcu- 
lated by  the  Harvard  linkage  procedure  by  specialty.  Seven  hundred 
services  were  compared  across  the  42  specialties  to  calculate  these 
means.  These  700  services  corresponded  to  522  CPT  codes,  as  some 
codes  appeared  in  more  than  one  specialty.  Of  these  522,  Harvard  did 
not  publish  work  values  for  46.  In  Table  8,  the  mean  of  the  differ- 
ences between  our  work  values  and  the  Harvard  work  values  on  the 
logarithmic  scale  is  given  for  each  specialty  in  the  column  labeled  Aj. 
The  middle  column  shows  the  standard  deviation  of  the  differences  on 
the  logarithmic  scale.  The  final  column  shows  the  percentage  change 
in  the  work  values  on  the  nonlogarithmic  scale  that  corresponds  to 
the  mean  difference. 

Figure  1  shows  a  histogram  of  the  individual  percentage  differences 
for  the  476  surveyed  CPT  codes  for  which  Harvard  published  work 
values.  Although  most  of  the  RAND  RVWs  are  within  15  percent  of 
the  original  values,  one-fifth  of  the  CPT  codes  have  larger  discrepan- 
cies. 

In  general,  though,  the  percentage  changes  are  smaller  than  those 
shown  in  Table  7,  which  indicates  that  the  second  stage  of  the  pertur- 
bation minimization  optimization  (Equation  (3))  has  a  similar  effect 
to  the  Harvard  averaging  procedure.  This  similarity  makes  sense, 
since  the  new  procedure  sought  to  take  into  account  the  averaging 
goals  automatically  by  including  same  CPT  code  links  and  by  forcing 
members  of  an  orbit  to  have  the  same  work  value.  As  before,  the  spe- 
cialty with  the  largest  change  is  anesthesiology  with  a  loss  of  over  70 
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Table  8 

Comparison  of  Harvard  Phase  HI 

Final  Work  Values  and  Our  Perturbation 

Minimization  Results 


Percent 

Specialty 

Ai 

S.D. 

Change 

Allergy/immunology 

0.054 

0.078 

13.16 

Anesthesiology 

-0.536 

0.206 

-70.88 

Cardiology 

0.002 

0.066 

0.37 

Dermatology/I 

0.050 

0.069 

12.31 

Dermatology/n 

0.037 

0.108 

8.97 

Emergency  medicine 

-0.017 

0.103 

-3.76 

Family  practice 

0.004 

0.083 

1.01 

Gastroenterology 

0.008 

0.081 

1.88 

General  surgery/I 

0.006 

0.258 

1.50 

General  surgery/II 

0.024 

0.022 

5.79 

Hematology/oncology 

0.014 

0.040 

3.41 

Infectious  diseases 

0.026 

0.046 

6.41 

Internal  medicine/I 

-0.009 

0.086 

-2.09 

Internal  medicine/II 

0.048 

0.074 

11.62 

Maxillofacial  surgery 

0.024 

0.179 

5.76 

Nephrology 

0.010 

0.011 

2.41 

Neurology 

-0.001 

0.011 

-0.13 

Neurosurgery 

0.070 

0.063 

17.41 

Nuclear  medicine 

-O.013 

0.014 

-2.89 

Obstetrics/gynecology 

-0.007 

0.109 

-1.64 

Ophthalmology/I 

0.045 

0.183 

10.96 

Ophthahnology/n 

-0.018 

0.160 

-4.12 

Ophthalmology/cornea 

-0.023 

0.175 

-5.07 

Ophthalmology/glaucoma 

-0.027 

0.103 

-6.09 

Orthopedic  surgery/I 

0.063 

0.075 

15.51 

Orthopedic  surgery/II 

0.049 

0.090 

11.83 

Osteopathy 

-0.019 

0.073 

-4.38 

Otolaryngology 

0.024 

0.060 

5.62 

Pathology/I 

-0.039 

0.046 

-8.52 

Pathology/II 

-0.020 

0.116 

-4.50 

Child  psychiatry 

0.114 

0.256 

30.12 

Pediatrics 

-0.019 

0.092 

^.28 

Physical  and  rehab. 

-0.004 

0.042 

-0.96 

Plastic  surgery 

0.030 

0.057 

7.15 

Pulmonary  medicine 

0.041 

0.036 

9.99 

Psychiatry/I 

0.081 

0.274 

20.58 

Psychiatry/II 

-0.010 

0.196 

-2.22 

Radiology 

-0.021 

0.029 

-4.66 

Rheumatology 

0.048 

0.059 

11.76 

Radiation  oncology 

0.073 

0.214 

18.30 

Thoracic  surgery 

0.090 

0.037 

23.08 

Urology 

0.063 

0.056 

15.61 

NOTE:  A  roman  nimieral  following  a  specialty  refers  to  the 
phase  of  the  Harvard  study.  In  Phase  11,  separate  surveys  were 
conducted  for  general  ophthalmology,  procedures  involving  the 
cornea,  and  procedures  related  to  glaucoma. 
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Figure  1 — Histogram  of  Percentage  Differences  in  Work  Values, 
RAND  vs.  Harvard  Linkage  Procedure 


percent.  Neurosurgery,  child  psychiatry,  psychiatry/I,  radiation  on- 
cology, and  thoracic  surgery  all  have  gains  of  over  15  percent.  When 
considering  these  averages,  the  standard  deviation  of  the  differences 
must  also  be  taken  into  account.  A  large  standard  deviation  indicates 
that  the  percentage  changes  for  services  within  a  specialty  varied 
widely. 


Implications  of  the  Alternative  Methodology 

The  conclusion  from  our  alternative  linkage  exercise  is  that  the  work 
values  change  considerably  depending  on  the  method  used  as  evi- 
denced in  Table  8.  This  means  that  the  results  of  a  linkage  procedure 
are  sensitive  to  the  assumptions  underlying  that  procedure. 
Although  our  personal  preference  is  for  our  set  of  assumptions  over 
those  of  the  Harvard  group,  our  claim  is  only  equal  preference.  We  do 
not  claim  that  our  results  are  better  than  Harvard's,  only  that  they 
are  different. 

Our  conclusion  is  just  a  beginning.  Without  further  sensitivity  analy- 
sis to  investigate  the  behavior  of  the  results,  for  example  with  the 
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deletion  or  addition  of  links,  the  validity  of  the  Harvard  results  is  un- 
clear. Given  the  considerable  public  response  to  the  RBRVS,  the  fu- 
ture use  of  this  or  any  other  linkage  procedure  without  much  more 
research  is  not  advised. 

One  possible  criticism  of  our  method  is  that  a  single  parameter  that 
embodies  a  particular  specialtj^s  shift  to  the  common  scale  is  not 
available.  The  Harvard  method  produced  single  values,  the  h  dis- 
cussed above.  Our  method  may  shift  services  within  a  specialty  by 
different  amounts,  so  no  such  sole  parameter  is  available.  The 
Harvard  group  used  these  specialty  shift  parameters  to  calculate  the 
work  values  for  services  that  were  surveyed  after  the  linkage  was 
performed.  The  assumption  in  using  a  specialty's  shift  parameter  to 
shift  future  surveyed  work  values  onto  the  common  scale  is  that  these 
surveyed  values  are  measured  in  the  same  units  as  the  original  ones. 
However,  this  assumption  is  violated  in  this  study  by  the  fact  that  the 
Harvard  shift  parameters  are  different  for  specialties  that  were 
surveyed  in  both  Phase  I  and  Phase  H,  such  as  dermatology  (Table  5). 
Thus,  this  criticism  of  our  alternative  linkage  procedure  is  not 
supported. 

A  key  question  in  hindsight  is:  Why  link  at  all?  The  need  for  linkage 
arose  because  physicians  ranked  their  specialty's  services  on  different 
scales  which  then  had  to  be  calibrated  onto  a  common  scale.  In  retro- 
spect, that  may  not  have  been  the  most  desirable  strategy;  establish- 
ing instead  a  common  inter-service  scale  for  all  specialty  surveys 
might  have  been  preferable. 

In  any  event,  the  issue  is  moot  with  respect  to  the  present  set  of  val- 
ues. For  the  future,  however,  the  lesson  learned  from  our  investiga- 
tion is  that  any  linkage  procedure  is  sensitive  to  its  assumptions  and 
better  avoided  if  possible.  Because  the  published  RVWs  provide  a 
common  scale  for  any  future  measurements,  we  recommend  that  fti- 
ture  surveys  of  physician  work  be  based  on  that  common  scale, 
thereby  eliminating  any  need  for  linkage.  Even  if  some  of  the  RVWs 
in  the  published  set  are  questionable,  the  set  is  large  enough  and 
dense  enough  so  that  some  of  the  values  are  consensually  regarded  as 
accurate  and  can  form  a  reference  value  foundation  for  future  mea- 
surement sessions. 


5.  RECOMMENDATIONS 


The  work  reported  here  has  concentrated  on  two  aspects  of  obtaining 
relative  work  values  for  the  Medicare  Fee  Schedule.  We  have  looked 
at  who  should  be  surveyed  for  data  to  construct  the  RBRVS,  how  data 
should  be  collected,  and  how  data  collected  from  diverse  specialities 
can  be  combined  into  a  common  scale  of  magnitude. 

COLLECTING  DATA 

On  the  basis  of  our  scrutiny  of  the  Harvard  study  Phase  II  examina- 
tion of  alternative  data-collection  methods,  in  conjunction  with  our 
review  of  the  psychological  literature  on  individual-  and  collective- 
based  decisionmaking,  we  recommend  that  any  future  magnitude  es- 
timation of  work  values  be  done  using  a  group-based  method  that 
provides  intermediate  feedback  to  group  members  to  permit  them  to 
adjust  their  individual  numerical  estimations.  The  two  leading  can- 
didates for  such  a  method  are  (1)  a  multiple-round  mail  survey  with 
feedback  on  the  distribution  of  responses  between  rounds  (i.e.,  a 
Delphi  process)  and  (2)  a  discussion  panel  preceded  by  a  preliminary 
mail  round.  In  terms  of  the  validity  of  responses,  we  have  no  evidence 
that  suggests  any  major  differences  between  methods.  The  discussion 
method  costs  about  10  percent  more  than  the  Delphi  method,  largely 
because  of  the  travel  costs  required  to  assemble  the  group. 

With  costs  approximately  equal  and  with  no  expectation  that  the  two 
group  methods  will  differ  in  results,  the  choice  between  them  can  be 
based  on  priorities  not  related  to  validity  or  expense.  Because  of  the 
political  sensitivity  of  the  MFS,  the  Delphi  method  may  be  preferred 
because  it  surveys  a  larger  respondent  sample  and  thereby  permits 
greater  representativeness  among  geographical,  experience,  gender, 
racial,  and  other  potential  stratifying  factors.  Therefore,  for  long- 
term  updating  of  the  RBRVS,  the  Delphi  method  may  be  the  preferred 
one.  However,  in  response  to  short-term  demands,  where  answers 
need  to  arrive  in  timely  fashion,  a  discussion  panel  is  much  more  eas- 
ily assembled  and  is  therefore  probably  the  better  option. 

To  answer  the  question  of  who  should  provide  the  data,  the  need  for 
physicians  experienced  in  the  procedures  is  logically  unassailable. 
However,  the  presence  of  experienced  physicians  need  not  imply  the 
absence  of  others;  the  leavening  effect  of  primary  care  physicians  or 
other  specialty  representation  lessens  the  opportunities  for  gaming 
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and  helps  keep  measurements  from  different  specialties  aligned  on  a 
common  scale.  Because  of  the  potential  for  conflict  of  interest,  spe- 
cialty societies  should  not  have  exclusive  say  in  selecting  respondents 
to  Delphi  surveys  or  participants  in  discussion  panels.  We  recom- 
mend that  the  new  HCFA  universal  provider  file,  from  which  physi- 
cians with  the  necessary  experience  in  the  target  services  can  be  iden- 
tified, be  used  for  these  purposes.  A  possible  role  for  the  specialty 
society  representatives  might  be  as  consultative  experts  to  panels, 
discussing  issues  but  not  doing  the  ratings  themselves.  In  this  way, 
their  expertise  informs  the  panels  but  does  not  preordain  the  results. 

CREATING  A  COMMON  SCALE 

Our  examination  of  the  Harvard  study's  linkage  procedures  has  un- 
earthed a  number  of  possible  technical  problems  and  conceptual  am- 
biguities. The  technical  problems,  which  include  choice  of  individual 
or  group  standard  error  as  a  weighting  factor  in  linkage,  statistically 
eliminated  links  when  biweights  become  zero,  and  some  apparent  in- 
consistencies in  link  choices,  are  all  easily  fixed.  The  majority  of 
these  problems  do  not  involve  changes  of  more  than  5  percent  in  work 
values. 

The  conceptual  ambiguities,  however,  call  into  question  the  validity  of 
the  published  RBRVS.  Our  preliminary  analysis  shows  that  an  al- 
ternative specification  of  the  linkage  procedure,  which  we  believe  to 
be  at  least  as  consistent  with  the  philosophy  and  intent  of  the 
Harvard  group's  specification,  produces  adjustments  that  are  more 
than  trivially  different.  Linkage  is  demonstrably  sensitive  to  underly- 
ing assumptions  and  imtil  it  is  clearly  understood  through  further  in- 
vestigation, it  should  not  be  used  in  future  revisions  of  the  MFS. 

However,  given  that  a  set  of  RVWs  based  on  a  common  scale  of  mea- 
surement has  been  published  and  that  many  of  those  work  values  ap- 
pear to  be  acceptable  to  the  medical  profession,  the  need  for  links  to 
move  individual  specialty  work  value  estimates  to  a  common  scale 
may  be  obviated.  If  a  set  of  reference  values  that  contains  frequently 
performed  services  across  specialties  and  across  the  continuum  of 
work  values  can  be  validated,  then  that  set  can  be  employed  as  a 
defining  "ruler"  for  any  future  estimates  of  relative  work. 

AN  OVERALL  VIEW 

Nobody  claims  that  the  RBRVS  is  perfect.  It  is  not  perfect  now  and  it 
will  never  be  perfect.  However,  hardly  anybody  claims  that  the  sys- 
tem it  replaces  wasn't  broken.   The  system  of  customary,  reasonable. 
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and  prevailing  charges  was  too  arbitrary  and  too  uncertain  to  be  re- 
lied upon  as  the  basis  for  a  rational  payment  policy.  The  RBRVS, 
even  with  all  of  its  imperfections,  represents  an  advance.  Future 
work — even  work  that  causes  major  changes  in  the  RBRVS — can 
similarly  be  viewed  as  progress  toward  the  objective  of  fair  and 
consistent  policies  for  Medicare  physician  payment. 


Appendix 
SUMMARY  OF  RBRVS  DEVELOPMENT 
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